Solved – How to use the data after missing values imputation (using mice)

micemissing datamultiple-imputationr

This question is related to how to get a complete data set from one containing missing values, and how to impute new cases.

The mice R package impute missing values. Its algorithm produces M completed data sets, where all non-empty values remaing the same, but the empty are reeplaced based on a set of conditional densities based on certain distributions.

The doubt is: If it is needed to create some predictive model, is it valid to create the training data by averaging all values across M data sets, so the non-empty values will remain the same and the imputed cases will be averaged? That is for numeric variables, for categorical the mode can be used.

Other approach could be to append all cases to a "big" data set, and train the model with this set.

Finally once the model is running live on production, new cases will be imputed with the same criteria.

Does it makes sense?

The mice paper can be found at: https://www.jstatsoft.org/index.php/jss/article/view/v045i03/v45i03.pdf

library(mice)

# do default multiple imputation on a numeric matrix, 5 imputation data frames
imp <- mice(nhanes, m = 5)

# get a final data set containing the 5 imputed data frames, total rows=nrow(nhanes)*5
data_all <- complete(imp, "long")

# data_all contains the same columns as nhanes plus 2 more: '.id' and '.imp'
# .id=row number, from 1 to 25
# .imp=imputation data frame id, 1 to 5 ('m' parameter)

The grouping can be done using .id and .imp variables, or just use the final data as it is: data_all.

Best Answer

You should run the model on all the imputed datasets separately and then pool the resulting regression estimates. See the article you provided MICE: Multivariate Imputation by Chained Equations in R (especially chapter 5 and point 5.3) by Buuren & Groothuis-Oudshoorn, 2011. Then you can use the pooled model on the training data to test the model.

The problem with averaging imputed data is that you will omit some variability in the data. The problem with a second proposition is that you will artificially inflate the sample size. This will render error estimates of your predictions based on the model useless.

EXT: regression model used on the imputed data should be a nested model of a regression model used to impute data. This and pooling methods provided implies that MICE method was designed with regression analysis methods in mind. But there is no reason not to use other methods on imputed data.

The problem is how to generalize the models obtained on many imputed datasets. This step depends on the method used. In a case of tree based models you can simply append all the imputed data into one big data frame (as you proposed). The resulting tree model will in essence be the average pooled tree model.