Solved – Pooling imputed, still not analysed datasets in MICE

micemultiple-imputationpoolingrregression

I need to do a Multiple Imputation on a dataset with several missing values, and I need to do it with mice, because later I'll have to compare the results with those of imputations ran with other programs.

My colleague obtained a completa dataset by running a MI on the incomplete dataset, with 5 iterations, and then taking the 5 imputed datasets and manually calculating mean values. Se essentially he pooled manually. I'm far from an expert so I don't know if this operation is valid.

Anyways, in MICE so far I could run the imputation (again, with maxit=5), using the function "imp<-mice(eco)", where "eco" is the incomplete dataset. So I obtained the 5 imputed datasets, stored in the object "imp", of class "MIDS". Now I just need to pool the 5 completed datasets to obtain a unique complete one, i don't wanna run analyses on the 5 datasets and then pool the results. Can that be done? If I got it right from the manual, it seems that MICE allows you to pool only after you ran some analysis on the imputed datasets. The analysis is repeated on each dataset and the results are stored in an object of class "MIRA". I tried to run the function "pool()" on "imp" but it can't be done because imp is of class "mids" and you can only pool mira objects.

The manual also says that to pool you need a variance/covariance relation, and in the example given there they run a linear regression and then they pool the results of the regression. But I doubt if it's what I need. I'm confused

Best Answer

A major point of multiple imputations is to do separate analyses on each of the imputed data sets, so that you can get both pooled estimates of things like regression coefficients and an estimate of the errors in the coefficients. Averaging the imputed data sets first is not the correct use of this approach. And don't limit yourself to so few imputations; with modern computers there's no reason not to do 100 or more. See http://www.stefvanbuuren.nl/mi/MI.html, from the person who developed the mice package, for further information.

Related Solutions

Solved – How to get pooled p-values on tests done in multiple imputed datasets

Yes, it is possible and, yes, there are R functions that do it. Instead of computing the p-values of the repeated analyses by hand, you can use the package Zelig, which is also referred to in the vignette of the Amelia-package (for a more informative method see my update below). I'll use an example from the Amelia-vignette to demonstrate this:

library("Amelia")
data(freetrade)
amelia.out <- amelia(freetrade, m = 15, ts = "year", cs = "country")

library("Zelig")
zelig.fit <- zelig(tariff ~ pop + gdp.pc + year + polity, data = amelia.out$imputations, model = "ls", cite = FALSE)
summary(zelig.fit)

This is the corresponding output including $p$-values:

  Model: ls
  Number of multiply imputed data sets: 15 

Combined results:

Call:
lm(formula = formula, weights = weights, model = F, data = data)

Coefficients:
                Value Std. Error t-stat  p-value
(Intercept)  3.18e+03   7.22e+02   4.41 6.20e-05
pop          3.13e-08   5.59e-09   5.59 4.21e-08
gdp.pc      -2.11e-03   5.53e-04  -3.81 1.64e-04
year        -1.58e+00   3.63e-01  -4.37 7.11e-05
polity       5.52e-01   3.16e-01   1.75 8.41e-02

For combined results from datasets i to j, use summary(x, subset = i:j).
For separate results, use print(summary(x), subset = i:j).

zelig can fit a host of models other than least squares.

To get confidence intervals and degrees of freedom for your estimates you can use mitools:

library("mitools")
imp.data <- imputationList(amelia.out$imputations)
mitools.fit <- MIcombine(with(imp.data, lm(tariff ~ polity + pop + gdp.pc + year)))
mitools.res <- summary(mitools.fit)
mitools.res <- cbind(mitools.res, df = mitools.fit$df)
mitools.res

This will give you confidence intervals and proportion of the total variance that is attributable to the missing data:

              results       se    (lower    upper) missInfo    df
(Intercept)  3.18e+03 7.22e+02  1.73e+03  4.63e+03     57 %  45.9
pop          3.13e-08 5.59e-09  2.03e-08  4.23e-08     19 % 392.1
gdp.pc      -2.11e-03 5.53e-04 -3.20e-03 -1.02e-03     21 % 329.4
year        -1.58e+00 3.63e-01 -2.31e+00 -8.54e-01     57 %  45.9
polity       5.52e-01 3.16e-01 -7.58e-02  1.18e+00     41 %  90.8

Of course you can just combine the interesting results into one object:

combined.results <- merge(mitools.res, zelig.res$coefficients[, c("t-stat", "p-value")], by = "row.names", all.x = TRUE)

Update

After some playing around, I have found a more flexible way to get all necessary information using the mice-package. For this to work, you'll need to modify the package's as.mids()-function. Use Gerko's version posted in my follow-up question:

as.mids2 <- function(data2, .imp=1, .id=2){
  ini <- mice(data2[data2[, .imp] == 0, -c(.imp, .id)], m = max(as.numeric(data2[, .imp])), maxit=0)
  names  <- names(ini$imp)
  if (!is.null(.id)){
    rownames(ini$data) <- data2[data2[, .imp] == 0, .id]
  }
  for (i in 1:length(names)){
    for(m in 1:(max(as.numeric(data2[, .imp])))){
      if(!is.null(ini$imp[[i]])){
        indic <- data2[, .imp] == m & is.na(data2[data2[, .imp]==0, names[i]])
        ini$imp[[names[i]]][m] <- data2[indic, names[i]]
      }
    } 
  }
  return(ini)
}

With this defined, you can go on to analyze the imputed data sets:

library("mice")
imp.data <- do.call("rbind", amelia.out$imputations)
imp.data <- rbind(freetrade, imp.data)
imp.data$.imp <- as.numeric(rep(c(0:15), each = nrow(freetrade)))
mice.data <- as.mids2(imp.data, .imp = ncol(imp.data), .id = NULL)

mice.fit <- with(mice.data, lm(tariff ~ polity + pop + gdp.pc + year))
mice.res <- summary(pool(mice.fit, method = "rubin1987"))

This will give you all results you get using Zelig and mitools and more:

                  est       se     t    df Pr(>|t|)     lo 95     hi 95 nmis   fmi lambda
(Intercept)  3.18e+03 7.22e+02  4.41  45.9 6.20e-05  1.73e+03  4.63e+03   NA 0.571  0.552
pop          3.13e-08 5.59e-09  5.59 392.1 4.21e-08  2.03e-08  4.23e-08    0 0.193  0.189
gdp.pc      -2.11e-03 5.53e-04 -3.81 329.4 1.64e-04 -3.20e-03 -1.02e-03    0 0.211  0.206
year        -1.58e+00 3.63e-01 -4.37  45.9 7.11e-05 -2.31e+00 -8.54e-01    0 0.570  0.552
polity       5.52e-01 3.16e-01  1.75  90.8 8.41e-02 -7.58e-02  1.18e+00    2 0.406  0.393

Note, using pool() you can also calculate $p$-values with $df$ adjusted for small samples by omitting the method-parameter. What is even better, you can now also calculate $R^2$ and compare nested models:

pool.r.squared(mice.fit)

mice.fit2 <- with(mice.data, lm(tariff ~ polity + pop + gdp.pc))
pool.compare(mice.fit, mice.fit2, method = "Wald")$pvalue

Solved – Multiple Imputation Pooling mean and standard deviation

You can use the pool function that comes along with mice. In ist helpsite you will find the following example:

imp <- mice(nhanes, m=5, print = FALSE, seed = 55152)
fit <- with(data=imp,exp=lm(bmi~hyp+chl))
summary(pool(fit))

what gives

                    est         se          t       df     Pr(>|t|)       lo 95       hi 95 nmis       fmi     lambda
(Intercept) 19.63676903 4.33084987  4.5341606 15.73596 0.0003524824 10.44324131 28.83029675   NA 0.2394702 0.14858446
hyp         -0.43069297 2.07375135 -0.2076879 18.41666 0.8377520514 -4.78042837  3.91904242    8 0.1563331 0.06943169
chl          0.03803107 0.02241891  1.6963831 14.99264 0.1104714393 -0.00975576  0.08581789   10 0.2619634 0.16966640

Unfortunately, at the moment, I'm too blind to see/find the smart solution. So let's do the pooling according to Rubin's rules by hand. If we assume $x_i, ~ \ldots, x_n$ to be iid samples from $N(\mu, \sigma^2)$, then $\bar{x} \sim N(\mu, \sigma^2/n)$. We will need $\sigma^2/n$ for the within variance:

n <- nrow(nhanes)
m <- 5
Q <- array(dim = m)
U <- array(dim = m)

for (i in 1:m){
  Q[i] <- mean(complete(imp, i)$bmi)
  U[i] <- var(complete(imp, i)$bmi)/n
}
B <-  var(Q)
mean_of_means <- mean(Q)
total_variance_of_means <- mean(U) + (1 + 1/m) * B

Best Answer

Related Solutions

Solved – How to get pooled p-values on tests done in multiple imputed datasets

Update

Solved – Multiple Imputation Pooling mean and standard deviation

Related Question