Solved – compute 95% confidence interval for predictions using a pooled model after multiple imputation

confidence intervalmicer

I am conducting multiple imputation by chained equations in R using the MICE package, followed by a logistic regression on the imputed dataset.

I need to compute a 95% confidence interval about the predictions for use in creating a plot—that is, the grey shading in the image below.

enter image description here


I followed the approach described in the answer to this question…

How are the standard errors computed for the fitted values from a logistic regression?

…which uses the following lines of code to yield the std.er of prediction for any specific value of the predictor:

o <- glm(y ~ x, data = dat)
C <- c(1, 1.5)
std.er <- sqrt(t(C) %*% vcov(o) %*% C)

But of course I need to adapt this code to the fact that I am using a model resulting from multiple imputation. In that context, I am not sure which variance-covariance matrix (corresponding to “vcov(o)” in the above example) I should be using in my equation to produce the "std.er".


Based on the documentation for MICE I see three candidate matrices:

  • ubar – The average of the variance-covariance matrix of the complete data estimates.

  • b – The between imputation variance-covariance matrix.

  • t – The total variance-covariance matrix.

http://www.inside-r.org/packages/cran/mice/docs/is.mipo

Based on trying all three, the b matrix seems patently wrong, but both the t and the ubar matrices seem plausible. Can anybody confirm which one is appropriate?

Thank you.

Best Answer

The t matrix is the one to use in the way you describe. Eqs. 4 through 7 in the Dong & Peng paper that Joe_74 references correspond to the elements of the same names in the mipo object (documentation here), and so t is the accurate variance-covariance matrix for the pooled regression coefficients qbar you're actually using. ubar and b only matter here in that they are/were used to compute t.

Presumably you'll be using more than one predictor, so here's a MWE for that, which should be easy to modify.

set.seed(500)
dat <- data.frame(y = runif(20, 0, .5), x1 = c(runif(15),rep(NA, 5)), x2 = runif(20, 0.5))
imp <- mice(dat)
impMods <- with(imp, lm(y ~ x1 + x2))
pooledMod <- pool(impMods)
  # Generate some hypothetical cases we want predictions for
newCases <- data.frame(x1=c(4,7), x2=c(-6,0))
  # Tack on the column of 1's for the intercept
newCases <- cbind(1, newCases)
  # Generating the actual predictions is simple: sums of values times coefficients
yhats <- rowSums( sweep(newCases, 2, pooledMod$qbar, `*`) )
  # Take each new case and perform the standard operation
  # with the t matrix to get the pred. err.
predErr <- apply(newCases, 1, function(X) sqrt(t(X) %*% pooledMod$t %*% X))
  # Finally, put together a plot-worthy table of predictions with upper and lower bounds
  # (I'm just assuming normality here rather than using T-distribution critical values)
results <- data.frame(yhats, lwr=yhats-predErr*1.96, upr=yhats+predErr*1.96)