Solved – Cross Validation and Multiple Imputation for Missing Data

cross-validationmissing datamodel-evaluationmultiple-imputationpredictive-models

Using 10 fold CV for performance estimation of a logistic regression model, what is the appropriate way to incorporate multiple imputation for missingness across the predictors and outcome in which the mechanism is assumed to be missing at random? Also, do we include the outcome in the imputation models such that predictor variables can be used to impute the outcome and vice versa, or should we not consider the outcome in the imputation models?

This is what I am thinking:
Using the training data only (90%), perform multiple imputation 10 times, then fit a logistic regression model to each imputed dataset, then average model coefficients across the 10 imputed datasets to obtain a single logistic regression model with averaged coefficients. Using the test data only (10%), perform multiple imputation 10 times, then fit the 'average' logistic regression model obtained from the multiply imputed training data to each of the 10 imputed test datasets, then average the error across the 10 imputed test datasets to obtain the average error corresponding to the averaged model. Repeat this process 10 times (i.e., 10 fold CV) such that in each fold there is a single average model derived from the multiply imputed training data and a single average error of this model derived from applying the model to the multiply imputed test data. Then average the 10 average errors (1 from each fold) to obtain the final performance estimate.

Best Answer

I believe that your thinking is right.

The alternative is to perform multiple imputation on the entire dataset prior to splitting into train/test partitions. Doing so would mean that some information from the training sets is used to create/impute values in the test sets. In other words, there would be leakage from the training sets into the test sets, thereby biasing the results of cross-validation.

Related Question