Yes, it is possible and, yes, there are R
functions that do it. Instead of computing the p-values of the repeated analyses by hand, you can use the package Zelig
, which is also referred to in the vignette of the Amelia
-package (for a more informative method see my update below). I'll use an example from the Amelia
-vignette to demonstrate this:
library("Amelia")
data(freetrade)
amelia.out <- amelia(freetrade, m = 15, ts = "year", cs = "country")
library("Zelig")
zelig.fit <- zelig(tariff ~ pop + gdp.pc + year + polity, data = amelia.out$imputations, model = "ls", cite = FALSE)
summary(zelig.fit)
This is the corresponding output including $p$-values:
Model: ls
Number of multiply imputed data sets: 15
Combined results:
Call:
lm(formula = formula, weights = weights, model = F, data = data)
Coefficients:
Value Std. Error t-stat p-value
(Intercept) 3.18e+03 7.22e+02 4.41 6.20e-05
pop 3.13e-08 5.59e-09 5.59 4.21e-08
gdp.pc -2.11e-03 5.53e-04 -3.81 1.64e-04
year -1.58e+00 3.63e-01 -4.37 7.11e-05
polity 5.52e-01 3.16e-01 1.75 8.41e-02
For combined results from datasets i to j, use summary(x, subset = i:j).
For separate results, use print(summary(x), subset = i:j).
zelig
can fit a host of models other than least squares.
To get confidence intervals and degrees of freedom for your estimates you can use mitools
:
library("mitools")
imp.data <- imputationList(amelia.out$imputations)
mitools.fit <- MIcombine(with(imp.data, lm(tariff ~ polity + pop + gdp.pc + year)))
mitools.res <- summary(mitools.fit)
mitools.res <- cbind(mitools.res, df = mitools.fit$df)
mitools.res
This will give you confidence intervals and proportion of the total variance that is attributable to the missing data:
results se (lower upper) missInfo df
(Intercept) 3.18e+03 7.22e+02 1.73e+03 4.63e+03 57 % 45.9
pop 3.13e-08 5.59e-09 2.03e-08 4.23e-08 19 % 392.1
gdp.pc -2.11e-03 5.53e-04 -3.20e-03 -1.02e-03 21 % 329.4
year -1.58e+00 3.63e-01 -2.31e+00 -8.54e-01 57 % 45.9
polity 5.52e-01 3.16e-01 -7.58e-02 1.18e+00 41 % 90.8
Of course you can just combine the interesting results into one object:
combined.results <- merge(mitools.res, zelig.res$coefficients[, c("t-stat", "p-value")], by = "row.names", all.x = TRUE)
Update
After some playing around, I have found a more flexible way to get all necessary information using the mice
-package. For this to work, you'll need to modify the package's as.mids()
-function. Use Gerko's version posted in my follow-up question:
as.mids2 <- function(data2, .imp=1, .id=2){
ini <- mice(data2[data2[, .imp] == 0, -c(.imp, .id)], m = max(as.numeric(data2[, .imp])), maxit=0)
names <- names(ini$imp)
if (!is.null(.id)){
rownames(ini$data) <- data2[data2[, .imp] == 0, .id]
}
for (i in 1:length(names)){
for(m in 1:(max(as.numeric(data2[, .imp])))){
if(!is.null(ini$imp[[i]])){
indic <- data2[, .imp] == m & is.na(data2[data2[, .imp]==0, names[i]])
ini$imp[[names[i]]][m] <- data2[indic, names[i]]
}
}
}
return(ini)
}
With this defined, you can go on to analyze the imputed data sets:
library("mice")
imp.data <- do.call("rbind", amelia.out$imputations)
imp.data <- rbind(freetrade, imp.data)
imp.data$.imp <- as.numeric(rep(c(0:15), each = nrow(freetrade)))
mice.data <- as.mids2(imp.data, .imp = ncol(imp.data), .id = NULL)
mice.fit <- with(mice.data, lm(tariff ~ polity + pop + gdp.pc + year))
mice.res <- summary(pool(mice.fit, method = "rubin1987"))
This will give you all results you get using Zelig
and mitools
and more:
est se t df Pr(>|t|) lo 95 hi 95 nmis fmi lambda
(Intercept) 3.18e+03 7.22e+02 4.41 45.9 6.20e-05 1.73e+03 4.63e+03 NA 0.571 0.552
pop 3.13e-08 5.59e-09 5.59 392.1 4.21e-08 2.03e-08 4.23e-08 0 0.193 0.189
gdp.pc -2.11e-03 5.53e-04 -3.81 329.4 1.64e-04 -3.20e-03 -1.02e-03 0 0.211 0.206
year -1.58e+00 3.63e-01 -4.37 45.9 7.11e-05 -2.31e+00 -8.54e-01 0 0.570 0.552
polity 5.52e-01 3.16e-01 1.75 90.8 8.41e-02 -7.58e-02 1.18e+00 2 0.406 0.393
Note, using pool()
you can also calculate $p$-values with $df$ adjusted for small samples by omitting the method
-parameter. What is even better, you can now also calculate $R^2$ and compare nested models:
pool.r.squared(mice.fit)
mice.fit2 <- with(mice.data, lm(tariff ~ polity + pop + gdp.pc))
pool.compare(mice.fit, mice.fit2, method = "Wald")$pvalue
You can use the pool
function that comes along with mice
. In ist helpsite you will find the following example:
imp <- mice(nhanes, m=5, print = FALSE, seed = 55152)
fit <- with(data=imp,exp=lm(bmi~hyp+chl))
summary(pool(fit))
what gives
est se t df Pr(>|t|) lo 95 hi 95 nmis fmi lambda
(Intercept) 19.63676903 4.33084987 4.5341606 15.73596 0.0003524824 10.44324131 28.83029675 NA 0.2394702 0.14858446
hyp -0.43069297 2.07375135 -0.2076879 18.41666 0.8377520514 -4.78042837 3.91904242 8 0.1563331 0.06943169
chl 0.03803107 0.02241891 1.6963831 14.99264 0.1104714393 -0.00975576 0.08581789 10 0.2619634 0.16966640
Unfortunately, at the moment, I'm too blind to see/find the smart solution. So let's do the pooling according to Rubin's rules by hand. If we assume $x_i, ~ \ldots, x_n$ to be iid samples from $N(\mu, \sigma^2)$, then $\bar{x} \sim N(\mu, \sigma^2/n)$. We will need $\sigma^2/n$ for the within variance:
n <- nrow(nhanes)
m <- 5
Q <- array(dim = m)
U <- array(dim = m)
for (i in 1:m){
Q[i] <- mean(complete(imp, i)$bmi)
U[i] <- var(complete(imp, i)$bmi)/n
}
B <- var(Q)
mean_of_means <- mean(Q)
total_variance_of_means <- mean(U) + (1 + 1/m) * B
Best Answer
A major point of multiple imputations is to do separate analyses on each of the imputed data sets, so that you can get both pooled estimates of things like regression coefficients and an estimate of the errors in the coefficients. Averaging the imputed data sets first is not the correct use of this approach. And don't limit yourself to so few imputations; with modern computers there's no reason not to do 100 or more. See http://www.stefvanbuuren.nl/mi/MI.html, from the person who developed the
mice
package, for further information.