Multiple Imputation for Predictors Only, Excluding Missing Outcome Data

elastic netmicemultiple-imputation

I am working with a dataset containing ~300 predictors and ~3000 observations and building a predictive model using elastic net (and hoping to generalize to an external validation set). While the majority of observations are complete cases, there are some observations with missing values in either some of the predictors, the outcome, or both. My current approach is to remove all observations with missing outcome data from the analysis, and use mice (in R) to perform multiple imputation for missing values of the predictors. To me, this was the approach that made sense intuitively, as I was concerned about reporting performance metrics on observations that did not have observed values of the outcome.

However, I have seen that it may be valid to include observations with missing outcome data in the dataset, and let the outcome values also be handled through multiple imputation. I was curious about the conditions in which one method would be preferred over the other, if any. My suspicion is that it may be better to re-do these analyses while also imputing missing outcome data and missing predictor data, rather than using this "complete outcome" approach. Any insight is appreciated, and I'm happy to provide more information if needed!

Best Answer

From your description, you might be better off doing imputation on all your observations. There is no need to remove cases with missing outcome values, as analysis of properly performed multiply imputed data sets will incorporate the uncertainty from imputing the outcome values. Stef van Buuren's Flexible Imputation of Missing Data (FIMD) book certainly advocates imputing missing outcomes.

How much of a difference that will make depends on details of your data, whether missingness depends on outcome values, and whether your complete-data model is correct.

Depending on your situation, even complete-case analysis might be OK. Stef Van Buuren outlines some such special cases, in particular (FIMD, Section 2.7):

The first special case occurs if the probability to be missing does not depend on [outcome] $Y$. Under the assumption that the complete-data model is correct, the regression coefficients [of complete-case analysis] are free of bias. This holds for any type of regression analysis, and for missing data in both $Y$ and $X$. Since the missing data rate may depend on $X$, complete-case analysis will in fact work in a relevant class of MNAR models.

That depends on missingness not depending on $Y$ and having a correct complete-data model. Nevertheless,

Multiple imputation gains an advantage over complete-case analysis if additional predictors for $Y$ are available that are not part of [the complete-case model predictors] $X$.

That would seem to be your situation, as the predictor selection in elastic net means that there are potential predictors of $Y$ that will not be in the final set of predictors in the ultimate model.

Related Question