Solved – Compare the output of a pooled model after multiple imputation vs model on combined long dataset

micemissing datamultiple-imputationr

My question is about differences in approaches to analysing data generated with Multiple-Imputation via Chained Equations.

I am using the R-package: MICE

Broadly my query describes the following siutation:

Take a dataset X missing values of some variables 1,2,3.
Apply Multiple Imputation via Chained Equations.
Generate Z complete datasets

Impute Data
imputed <- mice(data = heartattack, m =20, maxit = 50, seed = 100)

Compare two different approaches to analysing that data

My question is whether the output of the next two steps should be the same (I think not but am new to MICE so want to check):

1) Combine the multiple complete datasets into a single long dataset.
Run a model (say logistic regression) on this long-combined dataset.

imputed_long <- complete(imputed, "long")
imputed_long_model <- glm(attack~ smokes+female+hsgrad, family = binomial(), data = imputed_long)
imputed_long_OR <- exp(cbind(OR = coef(imputed_long_model), confint(imputed_long_model)))
imputed_long_summary <- summary(imputed_long_model)
imputed_long_summary <- cbind(imputed_long_OR, imputed_long_summary$coeffiecients)
imputed_long_summary

2) Run the model on each imputed dataset and pool the results.

imputed_model <- with(imputed,glm(attack~ smokes+female+hsgrad, family = binomial()))
imputed_model_summary <- (summary(pool(imputed_model)))
imputed_model_OR <- exp(cbind(imputed_model_summary[,1],imputed_model_summary[,6],imputed_model_summary[,7]))
imputed_model_summary <- (cbind(imputed_model_OR,imputed_model_summary))
imputed_model_summary

I seem to get similar point estimates for the dependent variable effect-size but different 95% CIs (tighter 95% CIs in the long dataset model). I wondered if this is because the long-dataset model will only account for within-imputation model (i.e uncertainty in the model) but not the between-imputation variability – because by making the data into a single long dataset you have removed that.

My feeling from reading the literature is that Approach 2 (run on each imputed dataset then pool) is the correct one but would be grateful for feedback!

Please let me know if clarifications needed.

Best Answer

Short answer: your gut-feeling is right.

Longer answer: The strength of imputation lies in the pooling procedure. If you read the MICE manual, the writers go in depth about this. They state that imputation is not a technique that you apply to a dataset with missing data to complete the empty cells, but that it is a combination of setting up a strategy to replace missing data (using chained equations in the mice case), performing the analysis and subsequently pooling the result which answers your research question (i.e. the reason you performed the analysis). As such, these steps are mandatory.

Now, more specifically to your situation. In the original data with missings there might be data which is selectively missing. This could lead to bias. Moreover, most analyses require complete data on all variables, so you'll need to exclude cases or handle them in some way. Using imputation you complete the missing data with 'guestimates' based on the assumption your data is 'missing at random, conditional on known and observed data' (MAR). However, because these are guesses based on your data, you add some randomness and repeat this completion process multiple times in order to create a distribution of guesses.

If you would analyse these data in the 'long' format you mention, you'd have basically pumped up your sample's size with a factor equal to the number of imputation sets! This will undoubtedly increase the precision of your estimates. But, this is wrong! Cases which were complete from the start will have been copied and more importantly, you did not take into account the uncertainty of your guestimates.

The better way therefore is to analyze the data per imputation set. This way you get an 'm' amount of results. However, you do not know which imputation set is the 'most correct' (if there even is such a dataset). As such the average coefficient of all models is your best estimate for the 'true' estimate. For the precision (and hypothesis testing/confidence intervals) you then need to appropriately handle the standard error. Now the uncertainty comes into play. Using Rubins rules you average the standard errors of all and add 'a little extra' to represent the variation of the estimates across imputation sets.

Conclusion

Finally, creating your confidence intervals and performing your hypothesis tests using these pooling rules usually decreases bias and biased inferences compared to a complete case analysis. Compared to your long-format dataset, the coefficient might be pretty similar, but as you and your gut-feeling rightly pointed out, the results are way to precise (too narrow confidence intervals; too low p-values) than can actually should be concluded from these imputation analyses.

Related Solutions

Solved – Pooling imputed, still not analysed datasets in MICE

A major point of multiple imputations is to do separate analyses on each of the imputed data sets, so that you can get both pooled estimates of things like regression coefficients and an estimate of the errors in the coefficients. Averaging the imputed data sets first is not the correct use of this approach. And don't limit yourself to so few imputations; with modern computers there's no reason not to do 100 or more. See http://www.stefvanbuuren.nl/mi/MI.html, from the person who developed the mice package, for further information.

Solved – compute 95% confidence interval for predictions using a pooled model after multiple imputation

The t matrix is the one to use in the way you describe. Eqs. 4 through 7 in the Dong & Peng paper that Joe_74 references correspond to the elements of the same names in the mipo object (documentation here), and so t is the accurate variance-covariance matrix for the pooled regression coefficients qbar you're actually using. ubar and b only matter here in that they are/were used to compute t.

Presumably you'll be using more than one predictor, so here's a MWE for that, which should be easy to modify.

set.seed(500)
dat <- data.frame(y = runif(20, 0, .5), x1 = c(runif(15),rep(NA, 5)), x2 = runif(20, 0.5))
imp <- mice(dat)
impMods <- with(imp, lm(y ~ x1 + x2))
pooledMod <- pool(impMods)
  # Generate some hypothetical cases we want predictions for
newCases <- data.frame(x1=c(4,7), x2=c(-6,0))
  # Tack on the column of 1's for the intercept
newCases <- cbind(1, newCases)
  # Generating the actual predictions is simple: sums of values times coefficients
yhats <- rowSums( sweep(newCases, 2, pooledMod$qbar, `*`) )
  # Take each new case and perform the standard operation
  # with the t matrix to get the pred. err.
predErr <- apply(newCases, 1, function(X) sqrt(t(X) %*% pooledMod$t %*% X))
  # Finally, put together a plot-worthy table of predictions with upper and lower bounds
  # (I'm just assuming normality here rather than using T-distribution critical values)
results <- data.frame(yhats, lwr=yhats-predErr*1.96, upr=yhats+predErr*1.96)

Best Answer

Related Solutions

Solved – Pooling imputed, still not analysed datasets in MICE

Solved – compute 95% confidence interval for predictions using a pooled model after multiple imputation

Related Question