Solved – Problems with Missing values

data-imputationmissing data

I have a data set for a predictive model(predicting survival rate with certain acute medical condition on some animals) with 25 predictors where around 30% of the predictors are complete, 3 predictors are sitting at 30%,25% and 20% missing, and the others are missing at around 5% level. Around 50% of my data are complete cases. I'm new to dealing with missing values so I have a couple of questions regarding how to deal with it:

  1. What can I do with the variable with 30% missing assuming it's MAR? Is 30% too high for imputation? What kinds of metrics can I use to make the decision between removal of the predictor, listwise removal, imputation or some other options?

  2. How should I deal with a predictor with 20% – 25% missing when I have reason to believe that it's MNAR?

  3. I'm thinking of using imputation on the remaining predictors with around 5% missingness. How do I make the decision on which imputation methods to use? Are they chosen on a case by case basis based on the individual predictors? How does one impute categorical values?

  4. How does(should?) the imputation carry out in practice? Should it use complete cases or compute it iteratively somehow?

  5. Does feature selection comes in before or after dealing with missing values? if at all?

Best Answer

It's generally unwise to throw away information, which is what you do with complete-cases analysis or by throwing out predictors.

One of the advantages of multiple imputation instead of a single imputation of missing data is that the result incorporates the variability introduced by the imputation process while in principle using all the available information. Coefficients associated with the variable having 30% missing values thus may have larger standard errors than coefficients from variables with few missing values, but there is no a priori reason to omit such a variable. It might be worse to omit such a variable, as information in the cases having values for that variable might improve the imputations for other variables. Even if for some reason you don't keep it as a predictor variable, it can still be included as part of the imputation process.

The link above provides a simple introduction to the process of generating and using the multiple sets of imputations. You draw the imputations from a probability distribution, perform your regressions on each of the imputation sets, and then pool the results among the sets. With this number of predictors it might be best to do the imputations first and then do feature selection if feature selection is really necessary. With only 25 predictors you might be better off doing a ridge regression that keeps all the predictors, with appropriate penalization, and tends to treat collinear predictors together.

The mice package in R provides the tools that you need. The chained-equation approach makes it straightforward to deal with imputations of several variables at a time. You should devote some effort to setting up the structure of the imputations in a way that makes sense based on your understanding of the subject matter.

Two warnings. First, if one of your predictors is really "missing not at random" (MNAR) in the technical sense, then you will need to use special care and develop a joint model of the outcome variable and the predictor. It's possible, however, to think that data are MNAR when they really might be MAR, as this question illustrates. MAR only requires "given the observed data, [missingness] does not depend on the unobserved data". So consider carefully whether your predictor really threatens to be MNAR.

Second, you should think about how you will be using this model for prediction. If there are some predictors that are likely to be missing in many cases going forward, not just frequently omitted from your present data set, and you are going to be making predictions on a case-by-case basis, then you have to consider carefully how you would make your predictions in such cases and whether that variable should be included in your model.