I'm doing a project that involves replacing missing values in a set of data (first time doing this). This involves using two methods replacement by mean
and replacement by median
to fill in the missing values. There is not a lot of difference between the results of the minimum, median, maximum, mean and standard deviation of the data using both methods and I was wondering which method is better and how can I make a decision to which one is better using the results produced?
Solved – Which is better, replacement by mean and replacement by median
data-imputationmeanmedian
Best Answer
It always depends on your data and your task.
If there is a dataset that have great outliers, I'll prefer median. E.x.: 99% of household income is below 100, and 1% is above 500.
On the other hand, if we work with wear of clothes that customers give to dry-cleaner (assuming that dry-cleaners' operators fill this field intuitively), I'll fill missings with mean value of wear.
It is better to start from data understanding and then this article will be helpful starting point.