Solved – Which is better, replacement by mean and replacement by median

data-imputationmeanmedian

I'm doing a project that involves replacing missing values in a set of data (first time doing this). This involves using two methods replacement by mean and replacement by median to fill in the missing values. There is not a lot of difference between the results of the minimum, median, maximum, mean and standard deviation of the data using both methods and I was wondering which method is better and how can I make a decision to which one is better using the results produced?

Best Answer

It always depends on your data and your task.

If there is a dataset that have great outliers, I'll prefer median. E.x.: 99% of household income is below 100, and 1% is above 500.

On the other hand, if we work with wear of clothes that customers give to dry-cleaner (assuming that dry-cleaners' operators fill this field intuitively), I'll fill missings with mean value of wear.

It is better to start from data understanding and then this article will be helpful starting point.