Yes, it is possible and, yes, there are R
functions that do it. Instead of computing the p-values of the repeated analyses by hand, you can use the package Zelig
, which is also referred to in the vignette of the Amelia
-package (for a more informative method see my update below). I'll use an example from the Amelia
-vignette to demonstrate this:
library("Amelia")
data(freetrade)
amelia.out <- amelia(freetrade, m = 15, ts = "year", cs = "country")
library("Zelig")
zelig.fit <- zelig(tariff ~ pop + gdp.pc + year + polity, data = amelia.out$imputations, model = "ls", cite = FALSE)
summary(zelig.fit)
This is the corresponding output including $p$-values:
Model: ls
Number of multiply imputed data sets: 15
Combined results:
Call:
lm(formula = formula, weights = weights, model = F, data = data)
Coefficients:
Value Std. Error t-stat p-value
(Intercept) 3.18e+03 7.22e+02 4.41 6.20e-05
pop 3.13e-08 5.59e-09 5.59 4.21e-08
gdp.pc -2.11e-03 5.53e-04 -3.81 1.64e-04
year -1.58e+00 3.63e-01 -4.37 7.11e-05
polity 5.52e-01 3.16e-01 1.75 8.41e-02
For combined results from datasets i to j, use summary(x, subset = i:j).
For separate results, use print(summary(x), subset = i:j).
zelig
can fit a host of models other than least squares.
To get confidence intervals and degrees of freedom for your estimates you can use mitools
:
library("mitools")
imp.data <- imputationList(amelia.out$imputations)
mitools.fit <- MIcombine(with(imp.data, lm(tariff ~ polity + pop + gdp.pc + year)))
mitools.res <- summary(mitools.fit)
mitools.res <- cbind(mitools.res, df = mitools.fit$df)
mitools.res
This will give you confidence intervals and proportion of the total variance that is attributable to the missing data:
results se (lower upper) missInfo df
(Intercept) 3.18e+03 7.22e+02 1.73e+03 4.63e+03 57 % 45.9
pop 3.13e-08 5.59e-09 2.03e-08 4.23e-08 19 % 392.1
gdp.pc -2.11e-03 5.53e-04 -3.20e-03 -1.02e-03 21 % 329.4
year -1.58e+00 3.63e-01 -2.31e+00 -8.54e-01 57 % 45.9
polity 5.52e-01 3.16e-01 -7.58e-02 1.18e+00 41 % 90.8
Of course you can just combine the interesting results into one object:
combined.results <- merge(mitools.res, zelig.res$coefficients[, c("t-stat", "p-value")], by = "row.names", all.x = TRUE)
Update
After some playing around, I have found a more flexible way to get all necessary information using the mice
-package. For this to work, you'll need to modify the package's as.mids()
-function. Use Gerko's version posted in my follow-up question:
as.mids2 <- function(data2, .imp=1, .id=2){
ini <- mice(data2[data2[, .imp] == 0, -c(.imp, .id)], m = max(as.numeric(data2[, .imp])), maxit=0)
names <- names(ini$imp)
if (!is.null(.id)){
rownames(ini$data) <- data2[data2[, .imp] == 0, .id]
}
for (i in 1:length(names)){
for(m in 1:(max(as.numeric(data2[, .imp])))){
if(!is.null(ini$imp[[i]])){
indic <- data2[, .imp] == m & is.na(data2[data2[, .imp]==0, names[i]])
ini$imp[[names[i]]][m] <- data2[indic, names[i]]
}
}
}
return(ini)
}
With this defined, you can go on to analyze the imputed data sets:
library("mice")
imp.data <- do.call("rbind", amelia.out$imputations)
imp.data <- rbind(freetrade, imp.data)
imp.data$.imp <- as.numeric(rep(c(0:15), each = nrow(freetrade)))
mice.data <- as.mids2(imp.data, .imp = ncol(imp.data), .id = NULL)
mice.fit <- with(mice.data, lm(tariff ~ polity + pop + gdp.pc + year))
mice.res <- summary(pool(mice.fit, method = "rubin1987"))
This will give you all results you get using Zelig
and mitools
and more:
est se t df Pr(>|t|) lo 95 hi 95 nmis fmi lambda
(Intercept) 3.18e+03 7.22e+02 4.41 45.9 6.20e-05 1.73e+03 4.63e+03 NA 0.571 0.552
pop 3.13e-08 5.59e-09 5.59 392.1 4.21e-08 2.03e-08 4.23e-08 0 0.193 0.189
gdp.pc -2.11e-03 5.53e-04 -3.81 329.4 1.64e-04 -3.20e-03 -1.02e-03 0 0.211 0.206
year -1.58e+00 3.63e-01 -4.37 45.9 7.11e-05 -2.31e+00 -8.54e-01 0 0.570 0.552
polity 5.52e-01 3.16e-01 1.75 90.8 8.41e-02 -7.58e-02 1.18e+00 2 0.406 0.393
Note, using pool()
you can also calculate $p$-values with $df$ adjusted for small samples by omitting the method
-parameter. What is even better, you can now also calculate $R^2$ and compare nested models:
pool.r.squared(mice.fit)
mice.fit2 <- with(mice.data, lm(tariff ~ polity + pop + gdp.pc))
pool.compare(mice.fit, mice.fit2, method = "Wald")$pvalue
Rubin's rules can only be applied to parameters following a normal distribution. For parameters with a F or Chi Square distribution a different set of formulas is needed:
- Allison, P. D. (2002). Missing data. Newbury Park, CA: Sage.
For performing an ANOVA on multiple imputed datasets you could use the R package miceadds (pdf; miceadds::mi.anova
).
Update 1
Here is a complete example:
Export your data from SPSS to R. In Spss save your dataset as .csv
Read in your dataset:
library(miceadds)
dat <– read.csv(file='your-dataset.csv')
Lets assume, that $reading$ is your dependent variable and that you have two factors
- gender, with male = 0 and female = 1
- treatment, with control = 0 and 'received treatment' = 1
Now lets convert them to factors:
dat$gender <- factor(dat$gender)
dat$treatment <- factor(dat$treatment)
Convert your dataset to a mids object, wehere we assume, that the first variable holds the imputation number (Imputation_ in SPSS):
dat.mids <- as.mids(dat)
Now you can perform an ANOVA:
fit <- mi.anova(mi.res=dat.mids, formula="reading~gender*treatment", type=3)
summary(fit)
Update 2 This is a reply to your second comment:
What you describe here is a data import/export related problem between SPSS and R. You could try to import the .sav
file directly into R and there are a bunch of dedicated packages for that: foreign
, rio
, gdata
, Hmisc
, etc. I prefer the csv-way, but that's a matter of taste and/or depends on the nature of your problem. Maybe you should also check some tutorials on youtube or other sources on the internet.
library(foreign)
dat <- read.spss(file='path-to-sav', use.value.labels=F, to.data.frame=T)
Update 3 This is a reply to your first comment:
Yes, you can do your analysis in SPSS and pool the F values in miceadds
(please note this example is taken from the miceadds::micombine.F
help page):
library(miceadds)
Fvalues <- c(6.76 , 4.54 , 4.23 , 5.45 , 4.78, 6.76 , 4.54 , 4.23 , 5.45 , 4.78,
6.76 , 4.54 , 4.23 , 5.45 , 4.78, 6.76 , 4.54 , 4.23 , 5.45 , 4.78 )
micombine(Fvalues, df1=4)
Best Answer
When you have a question about how multiple imputations are handled, compare the results on the individual imputed data sets to what was reported for the pooled imputations. I don't have your data, but the
jasa
dataset from thesurvival
package has some missing data in themscore
variable (among others) to use as an example. I got a warning related toDate
objects when I tried to use all the columns, so omit the first 3.That output on its own should start to set your mind at ease, as it states how much the results are affected by imputation.
You can ask the
mice
package to retrieve a list of all 10 imputed data sets.As you might expect, the coefficients reported for
models
are the averages of those for the individual models.The
age'
coefficient is specific to thercs()
function, so that function was used in all model fits. One caution: the documentation forfit.mult.impute()
specifies that the knot positions are kept constant after they are chosen for the first imputed data set.In contrast, the pooled variance/covariance matrix isn't the average of the 10 individual matrices.
In particular, notice the higher pooled variance for
mscore
than the average over the 10 imputations. That because additional computations applying Rubin's rules have been used.Some of the confusion might come from the following sentence from the
fit.mult.impute
help documentation (within thetranscan
documentation):You don't get a set of
models
from it. Instead you get a single pooled model. But it's not the fit for "the completed dataset for the final imputation," which is simply used as a template. The documentation then goes on to say:That's what's returned by
vcov(models)
. All thatPredict()
has to work with is that single pooled model, with the averaged coefficients and the imputation-corrected coefficient covariance matrix.