Solved – What are the pros and cons of using median imputation to handle missing value

data-imputationmeanmedianmissing data

I have to choose between median or mean imputation to handle missing values. I feel median imputation will work better because it is a number that is already present in the data set and is less susceptible to outlier errors as compared to mean imputation.

What might be the disadvantages of median imputation though?

Best Answer

These are not appropriate for computing missing data - consider the case of heteroskedasticity in the data - neither of these approaches would work if their were 'weird' or idiosyncratic values in your data. In fact it would be more damaging (ie less accurate) to use mean or median replacement in this case

if youre familiar with R, you could check out the MI package (my fave) or mice. This essentially runs a series of chained (ie bayesian) regressions on the data until some convergence criteria

other options are expectation maximization (subject to overfitting problems IMO) and Hotdeck imputation

check out these resources for more explanation about why mean/median replacement is generally a bad idea

Rubin, D. B. (1996). Multiple imputation after 18+ years. Journal of the American Statistical Association, 91(434):473–489.

Schafer, J. L. (1999). Multiple imputation: a primer. Statistical Methods in Medical Research, 8:3–15.