Handling NA Values in R – Can NAs Be Replaced Based on Response Variable?

data transformationdata-imputationmissing datar

My data consists of 1 response variable 'Age' and 1 feature (beta). The feature contains some missing values (NA) so I want to replace them. I've been replacing them with the median of the feature. However when I plot the results I get the feeling I'm excessively butchering my data as replacing by the median does not appear to be fair as I appear to create outliers.
To improve, I've now chosen to take the mean of the closest 10 samples in age per NA. In this case the the replacement appears to be much more natural (perhaps too good).

In red the replace NAs
In red the replace NAs

Is it correct to do such a replacement? Are there other alternatives to mean or median NA replacement?

Best Answer

In short, you should look at multiple imputation (==replacement) techniques, first put forward by Rubin in 1987.

In more detail: replacing by a single value assumes certainty about this replaced value and might ignore any selective loss of information (and therefore bias!). Furthermore, you should try to think of the way your data got missing. In general there are three 'mechanisms' explaining missingness: Missing completely at random (MCAR): which roughly means the missing value is not related to any known or unknown properties of the unit/individual which was supposed to be measured. Missing at random (MAR): the missing value is related to known properties of the unit/individual which was supposed to be measured. Missing not at random (MNAR): the missing value is related to unknown properties of the unit/individual which was supposed to be measured.

These situations (MCAR, MAR, MNAR) are only theoretical in the extent that they often occur simultaneously within datasets, and even per missing value. There is an abundance of literature available which shows how different strategies of handling missing data pan out in different situations [1-5]. Make sure to check whichever is appropriate for your study.

In general (and this is generalizing a lot, sometimes based on opinions), it is preferable to use multiple imputation techniques. These techniques are based on estimating the missing values based on the known parts of the data multiple times, in order to create multiple completed imputation datasets. The intended analysis is then performed in all completed imputation datasets and pooled according to predefined rules taking into account the uncertainty which occurs when replacing missing values with estimates. Finally, this pooled analysis can be interpreted as you would have an analysis in a complete case database.

I've always found Stef van Buuren's MICE package in R very good for performing these techniques. Especially because he provides excellent background on both the biases of missing data, and the handling of the MICE function in the R programming language.

Do note that there are more ways you can implement multiple imputation techniques (see also Amelia Expectation Maximization for example).

References:

  1. Rubin DB. Multiple imputation for non response in surveys. New York: Wiley; 1987.
  2. Donders AR, van der Heijden GJ, Stijnen T, et al. Review: a gentle introduction to imputation of missing values. J Clin Epidemiol 2006;59(10):1087-1091.
  3. Li P, Stuart EA, Allison DB. Multiple Imputation: A Flexible Tool for Handling Missing Data. JAMA 2015;314(18):1966-1967.
  4. Groenwold RH, Donders AR, Roes KC, et al. Dealing with missing outcome data in randomized trials and observational studies. Am J Epidemiol 2012;175(3):210-217.
  5. van Buuren S, Groothuis-Oudshoorn K. mice: Multivariate Imputation by Chained Equations in R. J Stat Softw 2011;45(3):1-67.
  6. http://www.stefvanbuuren.nl/mi/MICE.html