Solved – Missing data and imputation in general

data-imputationmissing data

Handling missing data is a bit confusing for me. My questions are:

  1. Is it better to calculate imputations than simply leave out NAs and leave it to the (appropriate) model to handle it?

  2. Is there a common threshold in the column/whole dataset under which imputation generally not executable? I suppose, if there are only 3 observations in a 100,000-row table, it is nonsense to calculate the missing ones…

  3. Is it better to leave categoricals as such or should one convert it into dummies for better efficiency?

  4. I have learnt that R imputation packages (e. g. Amelia, mice) produce several "imputations" for NAs, but none of them combine them into a single "most probable" set. What can one do with m different imputations?

Best Answer

The following are my thoughts on the subject (as per your questions):

Is it better to calculate imputations than simply leave out NAs and leave it to the (appropriate) model to handle it?

I think that the answer to this is: it depends. Some models (or, more accurately, software that implements those models) can handle missing data automatically, due to implemented algorithms of either handling missing data per se, or embedding multiple imputation or similar methods into the modeling software (usually, functions, i.e. in R). Therefore, you need to carefully read the software's documentation to see what missing data handling features it offers to the user.

Another important point in finding the correct or optimal answer is determining (testing assumptions about) the nature/mode of missingness. I'm talking about MCAR, MAR, MNAR - for more details on this and, in general, for a comprehensive overview of the topic as well as approaches, methods and software for missing data handling, see the excellent paper by Horton and Kleinman (2007).

Is there a common threshold in the column/whole dataset under which imputation generally not executable? I suppose, if there are only 3 observations in a 100,000-row table, it is nonsense to calculate the missing ones...

I have not seen any common thresholds for this. Your example above is an extreme case and does not represent most of real data sets. Moreover, even a small level of missingness (say, several percentage points) in many variables might produce significant overall missingness in the model: "... missingness of just a few percent on each of a number of covariates may lead to a large number of observations with some missing information" (Horton & Kleinman, 2007, p. 79).

Is it better to leave categoricals as such or should one convert it into dummies for better efficiency?

As far as I know, in most cases it is OK to use categorical variables as is, of course, assuming that the software you're using supports that. Most software indeed has direct support of categorical variables - see paper by Horton and Kleinman (2007) for details. Perhaps, there exist some situations, when it would be beneficial to convert them, but as of now I'm not aware of such.

I have learnt that R imputation packages (e. g. Amelia, mice) produce several "imputations" for NAs, but none of them combine them into a single "most probable" set. What can one do with m different imputations?

To the best of my knowledge, this is not true. Both Amelia and mice provide functionality for aggregating the imputed results and even performing some types of statistical analysis. Even more integrated process can be found in the R-based Zelig software, which supports various statistical models and has an embedded support for missing data handling (via Amelia package).

NOTE: Keep in mind that Amelia, in addition to traditional MAR assumption, also has an assumption that the data you're trying to process is multivariate normal. So, if it is not the case, other options should be considered, such as mice or corresponding Hmisc functionality.

References

Horton, N. J., & Kleinman, K. P. (2007). Much ado about nothing: A comparison of missing data methods and software to fit incomplete data regression models. The American Statistician, 61(1), 79–90. doi:10.1198/000313007X172556 Retrieved from http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1839993

Related Question