Solved – Including dependent variables in multiple imputation model when they have missing values

missing datamultiple-imputation

I have a data set which has missing values on several columns. The analysis I am doing involves regressions where several of the variables are used as dependent variables, and others as explanatory variables.

For multiple imputation, the advice that I have read (e.g. Wulff, J. N., & Ejlskov, L. (2017). Multiple Imputation by Chained Equations in Praxis: Guidelines and Review) suggests that the DVs should be included in the imputation model. So I created a model where the DVs are included as predictors, but not imputed.

However, because the DVs have missing values themselves, the imputed explanatory variables end up still with some missing values, I assume because the DVs are being used as predictors in the imputation model. If I remove the DVs from the imputation model, I get (almost) no missing values for the imputed explanatory variables.

What is the correct way to handle this situation? Should I remove the DVs from the imputation model (contra Wulff & Ejlskov, and others)? Or should I impute the DVs as well, and if so, should I use the imputed DV values in the regressions?

Here is some R code to illustrate the problem:

library(mice)
library(missForest)

set.seed(123)

# Create a dataset with missing values on 2 columns
iris.mis = prodNA(iris[, c('Sepal.Length', 'Petal.Length')], 0.20)
iris.mis = cbind(iris.mis, iris[, c('Sepal.Width', 'Petal.Width', 'Species')])

# Setup MICE
init = mice(iris.mis, maxit = 0)
iris.metod = init$method
iris.method['Sepal.Length'] = "" # Do not impute Sepal.Length

# Run MICE
imp = mice(iris.mis, m = 5, maxit = 5, method = iris.method)

# Inspect Result
res = mice::complete(imp, 1)
print(length(res$Petal.Length[is.na(res$Petal.Length)]))

# [1] 8
# i.e. still 8 missing values in Petal.Length, when we wanted 0

Best Answer

See Kontopantelis et al. (2017), who describe the proper way to handle this situation. You should definitely retain the DV in the imputation model and use it to impute the predictors. You should use the predictors to impute the values of the DV. What the paper demonstrates is that it doesn't really matter whether you retain the individuals who originally had missing values for the DV or discard them. To me, it's preferable to retain them to keep your sample size larger.


Kontopantelis, E., White, I. R., Sperrin, M., & Buchan, I. (2017). Outcome-sensitive multiple imputation: A simulation study. BMC Medical Research Methodology, 17(1). https://doi.org/10.1186/s12874-016-0281-5

Related Question