Multiple Imputation – Using Multiple Imputation for Outcome Variables

meta-analysismeta-regressionmissing datamultiple-imputation

I've got a dataset on agricultural trials. My response variable is a response ratio: log(treatment/control). I'm interested in what mediates the difference, so I'm running RE meta-regressions (unweighted, because is seems pretty clear that effect size is uncorrelated with variance of estimates).

Each study reports grain yield, biomass yield, or both. I can't impute grain yield from studies that report biomass yield alone, because not all of the plants studied were useful for grain (sugar cane is included, for instance). But each plant that produced grain also had biomass.

For missing covariates, I've been using iterative regression imputation (following Andrew Gelman's textbook chapter). It seems to give reasonable results, and the whole process is generally intuitive. Basically I predict missing values, and use those predicted values to predict missing values, and loop through each variable until each variable approximately converges (in distribution).

Is there any reason why I can't use the same process to impute missing outcome data? I can probably form a relatively informative imputation model for biomass response ratio given grain response ratio, crop type, and other covariates that I have. I'd then average the coefficients and VCV's, and add the MI correction as per standard practice.

But what do these coefficients measure when the outcomes themselves are imputed? Is the interpretation of the coefficients any different than standard MI for covariates? Thinking about it, I can't convince myself that this doesn't work, but I'm not really sure. Thoughts and suggestions for reading material are welcome.

Best Answer

As you suspected, it is valid to use multiple imputation for the outcome measure. There are cases where this is useful, but it can also be risky. I consider the situation where all covariates are complete, and the outcome is incomplete.

If the imputation model is correct, we will obtain valid inferences on the parameter estimates from the imputed data. The inferences obtained from just the complete cases may actually be wrong if the missingness is related to the outcome after conditioning on the predictor, i.e. under MNAR. So imputation is useful if we know (or suspect) that the data are MNAR.

Under MAR, there are generally no benefits to impute the outcome, and for a low number of imputations the results may even be somewhat more variable because of simulation error. There is an important exception to this. If we have access to an auxiliary complete variable that is not part of the model and that is highly correlated with the outcome, imputation can be considerably more efficient than complete case analysis, resulting in more precise estimates and shorter confidence intervals. A common scenario where this occurs is if we have a cheap outcome measure for everyone, and an expensive measure for a subset.

In many data sets, missing data also occur in the independent variables. In these cases, we need to impute the outcome variable since its imputed version is needed to impute the independent variables.

Related Question