Solved – Compare the output of a pooled model after multiple imputation vs model on combined long dataset

micemissing datamultiple-imputationr

My question is about differences in approaches to analysing data generated with Multiple-Imputation via Chained Equations.

I am using the R-package: MICE

Broadly my query describes the following siutation:

  1. Take a dataset X missing values of some variables 1,2,3.
  2. Apply Multiple Imputation via Chained Equations.
  3. Generate Z complete datasets

Impute Data
imputed <- mice(data = heartattack, m =20, maxit = 50, seed = 100)

  1. Compare two different approaches to analysing that data

My question is whether the output of the next two steps should be the same (I think not but am new to MICE so want to check):

1) Combine the multiple complete datasets into a single long dataset.
Run a model (say logistic regression) on this long-combined dataset.

imputed_long <- complete(imputed, "long")
imputed_long_model <- glm(attack~ smokes+female+hsgrad, family = binomial(), data = imputed_long)
imputed_long_OR <- exp(cbind(OR = coef(imputed_long_model), confint(imputed_long_model)))
imputed_long_summary <- summary(imputed_long_model)
imputed_long_summary <- cbind(imputed_long_OR, imputed_long_summary$coeffiecients)
imputed_long_summary

vs

2) Run the model on each imputed dataset and pool the results.

imputed_model <- with(imputed,glm(attack~ smokes+female+hsgrad, family = binomial()))
imputed_model_summary <- (summary(pool(imputed_model)))
imputed_model_OR <- exp(cbind(imputed_model_summary[,1],imputed_model_summary[,6],imputed_model_summary[,7]))
imputed_model_summary <- (cbind(imputed_model_OR,imputed_model_summary))
imputed_model_summary

I seem to get similar point estimates for the dependent variable effect-size but different 95% CIs (tighter 95% CIs in the long dataset model). I wondered if this is because the long-dataset model will only account for within-imputation model (i.e uncertainty in the model) but not the between-imputation variability – because by making the data into a single long dataset you have removed that.

My feeling from reading the literature is that Approach 2 (run on each imputed dataset then pool) is the correct one but would be grateful for feedback!

Please let me know if clarifications needed.

Best Answer

Short answer: your gut-feeling is right.

Longer answer: The strength of imputation lies in the pooling procedure. If you read the MICE manual, the writers go in depth about this. They state that imputation is not a technique that you apply to a dataset with missing data to complete the empty cells, but that it is a combination of setting up a strategy to replace missing data (using chained equations in the mice case), performing the analysis and subsequently pooling the result which answers your research question (i.e. the reason you performed the analysis). As such, these steps are mandatory.

Now, more specifically to your situation. In the original data with missings there might be data which is selectively missing. This could lead to bias. Moreover, most analyses require complete data on all variables, so you'll need to exclude cases or handle them in some way. Using imputation you complete the missing data with 'guestimates' based on the assumption your data is 'missing at random, conditional on known and observed data' (MAR). However, because these are guesses based on your data, you add some randomness and repeat this completion process multiple times in order to create a distribution of guesses.

If you would analyse these data in the 'long' format you mention, you'd have basically pumped up your sample's size with a factor equal to the number of imputation sets! This will undoubtedly increase the precision of your estimates. But, this is wrong! Cases which were complete from the start will have been copied and more importantly, you did not take into account the uncertainty of your guestimates.

The better way therefore is to analyze the data per imputation set. This way you get an 'm' amount of results. However, you do not know which imputation set is the 'most correct' (if there even is such a dataset). As such the average coefficient of all models is your best estimate for the 'true' estimate. For the precision (and hypothesis testing/confidence intervals) you then need to appropriately handle the standard error. Now the uncertainty comes into play. Using Rubins rules you average the standard errors of all and add 'a little extra' to represent the variation of the estimates across imputation sets.

Conclusion

Finally, creating your confidence intervals and performing your hypothesis tests using these pooling rules usually decreases bias and biased inferences compared to a complete case analysis. Compared to your long-format dataset, the coefficient might be pretty similar, but as you and your gut-feeling rightly pointed out, the results are way to precise (too narrow confidence intervals; too low p-values) than can actually should be concluded from these imputation analyses.