Solved – How to choose which imputation to use to replace missing values

micemultiple-imputation

I am a psychologist (i.e. not a statistician or mathematician) and wish to replace missing values in my dataset. I have followed the steps here and they seem straightforward. But I really don't know how to decide upon the replacement values for the missing values to enter into my dataset. I have looked on these threads and some answers have come close but none exactly explain what I need to know.

The toy data provided here is very similar in structure to my own: blood pressure measurements taken at three time points; first is a baseline, second after abstinence, third after treatment.

set.seed(2345)
systolic1 <- rnorm(20, 120, 5)
diastolic1 <- rnorm(20, 90, 3)
systolic2 <- rnorm(20, 125, 5)
diastolic2 <- rnorm(20, 94, 3)
systolic3 <- rnorm(20, 120, 5)
diastolic3 <- rnorm(20, 90, 3)
df <- data.frame(systolic1, diastolic1, systolic2, diastolic2, systolic3, diastolic3)
df[c(1,4), 1:2] <- NA
df[1, 3:4] <- NA
df[15, 5:6] <- NA

So I run the mice package on the data.

tempdf <- mice(df, m=5, maxit=5, meth='pmm', seed=500)

This will yield five imputed values for each missing value, e.g.

tempdf$imp$systolic1

If I choose one of the imputations I can even obtain a complete dataset with missing values replaced.

completeData <- complete(tempdf, 3)

But how do I know which of the five imputations to choose (I chose number three here at random)? The link above and most of the relevant cross-validated threads suggest fitting a regression model for each variable with missing data. Now I can do this with:

modelFit1 <- with(tempdf, lm(systolic1 ~ diastolic1 + systolic2 + diastolic2 + systolic3 + diastolic3))

summary(pool(modelFit1))

This yields a table of coefficients. But what do the coefficients mean? How do they help me choose which of the imputed values above to choose? I'm sorry if this seems obvious to everyone, but I would greatly appreciate a simple explanation.

Best Answer

As mice works the goal is NOT to choose the best imputation (in your case out of the 5 you have performed above) for replacing the NA values in your variable.

You rather find the appropriate number of imputations and iterations and then get a pooled value. That pooled value could be a pooled coefficient for example on a regression model based on all m imputations or even the average of m imputations for replacing the NA in your variable.

The links below I think will give you the needed codes and understanding of the steps:

Codes: http://dept.stat.lsa.umich.edu/~jerrick/courses/stat701/notes/mi.html

Detailed PDF: https://www.jstatsoft.org/article/view/v045i03