Solved – How to get pooled p-values on tests done in multiple imputed datasets

multiple-imputationp-valuepoolingrspss

Using Amelia in R, I obtained multiple imputed datasets. After that, I performed a repeated measures test in SPSS. Now, I want to pool test results. I know that I can use Rubin's rules (implemented through any multiple imputation package in R) to pool means and standard errors, but how do I pool p-values? Is it possible? Is there a function in R to do so?
Thanks in advance.

Best Answer

Yes, it is possible and, yes, there are R functions that do it. Instead of computing the p-values of the repeated analyses by hand, you can use the package Zelig, which is also referred to in the vignette of the Amelia-package (for a more informative method see my update below). I'll use an example from the Amelia-vignette to demonstrate this:

library("Amelia")
data(freetrade)
amelia.out <- amelia(freetrade, m = 15, ts = "year", cs = "country")

library("Zelig")
zelig.fit <- zelig(tariff ~ pop + gdp.pc + year + polity, data = amelia.out$imputations, model = "ls", cite = FALSE)
summary(zelig.fit)

This is the corresponding output including $p$-values:

  Model: ls
  Number of multiply imputed data sets: 15 

Combined results:

Call:
lm(formula = formula, weights = weights, model = F, data = data)

Coefficients:
                Value Std. Error t-stat  p-value
(Intercept)  3.18e+03   7.22e+02   4.41 6.20e-05
pop          3.13e-08   5.59e-09   5.59 4.21e-08
gdp.pc      -2.11e-03   5.53e-04  -3.81 1.64e-04
year        -1.58e+00   3.63e-01  -4.37 7.11e-05
polity       5.52e-01   3.16e-01   1.75 8.41e-02

For combined results from datasets i to j, use summary(x, subset = i:j).
For separate results, use print(summary(x), subset = i:j).

zelig can fit a host of models other than least squares.

To get confidence intervals and degrees of freedom for your estimates you can use mitools:

library("mitools")
imp.data <- imputationList(amelia.out$imputations)
mitools.fit <- MIcombine(with(imp.data, lm(tariff ~ polity + pop + gdp.pc + year)))
mitools.res <- summary(mitools.fit)
mitools.res <- cbind(mitools.res, df = mitools.fit$df)
mitools.res

This will give you confidence intervals and proportion of the total variance that is attributable to the missing data:

              results       se    (lower    upper) missInfo    df
(Intercept)  3.18e+03 7.22e+02  1.73e+03  4.63e+03     57 %  45.9
pop          3.13e-08 5.59e-09  2.03e-08  4.23e-08     19 % 392.1
gdp.pc      -2.11e-03 5.53e-04 -3.20e-03 -1.02e-03     21 % 329.4
year        -1.58e+00 3.63e-01 -2.31e+00 -8.54e-01     57 %  45.9
polity       5.52e-01 3.16e-01 -7.58e-02  1.18e+00     41 %  90.8

Of course you can just combine the interesting results into one object:

combined.results <- merge(mitools.res, zelig.res$coefficients[, c("t-stat", "p-value")], by = "row.names", all.x = TRUE)

Update

After some playing around, I have found a more flexible way to get all necessary information using the mice-package. For this to work, you'll need to modify the package's as.mids()-function. Use Gerko's version posted in my follow-up question:

as.mids2 <- function(data2, .imp=1, .id=2){
  ini <- mice(data2[data2[, .imp] == 0, -c(.imp, .id)], m = max(as.numeric(data2[, .imp])), maxit=0)
  names  <- names(ini$imp)
  if (!is.null(.id)){
    rownames(ini$data) <- data2[data2[, .imp] == 0, .id]
  }
  for (i in 1:length(names)){
    for(m in 1:(max(as.numeric(data2[, .imp])))){
      if(!is.null(ini$imp[[i]])){
        indic <- data2[, .imp] == m & is.na(data2[data2[, .imp]==0, names[i]])
        ini$imp[[names[i]]][m] <- data2[indic, names[i]]
      }
    } 
  }
  return(ini)
}

With this defined, you can go on to analyze the imputed data sets:

library("mice")
imp.data <- do.call("rbind", amelia.out$imputations)
imp.data <- rbind(freetrade, imp.data)
imp.data$.imp <- as.numeric(rep(c(0:15), each = nrow(freetrade)))
mice.data <- as.mids2(imp.data, .imp = ncol(imp.data), .id = NULL)

mice.fit <- with(mice.data, lm(tariff ~ polity + pop + gdp.pc + year))
mice.res <- summary(pool(mice.fit, method = "rubin1987"))

This will give you all results you get using Zelig and mitools and more:

                  est       se     t    df Pr(>|t|)     lo 95     hi 95 nmis   fmi lambda
(Intercept)  3.18e+03 7.22e+02  4.41  45.9 6.20e-05  1.73e+03  4.63e+03   NA 0.571  0.552
pop          3.13e-08 5.59e-09  5.59 392.1 4.21e-08  2.03e-08  4.23e-08    0 0.193  0.189
gdp.pc      -2.11e-03 5.53e-04 -3.81 329.4 1.64e-04 -3.20e-03 -1.02e-03    0 0.211  0.206
year        -1.58e+00 3.63e-01 -4.37  45.9 7.11e-05 -2.31e+00 -8.54e-01    0 0.570  0.552
polity       5.52e-01 3.16e-01  1.75  90.8 8.41e-02 -7.58e-02  1.18e+00    2 0.406  0.393

Note, using pool() you can also calculate $p$-values with $df$ adjusted for small samples by omitting the method-parameter. What is even better, you can now also calculate $R^2$ and compare nested models:

pool.r.squared(mice.fit)

mice.fit2 <- with(mice.data, lm(tariff ~ polity + pop + gdp.pc))
pool.compare(mice.fit, mice.fit2, method = "Wald")$pvalue

Related Solutions

Solved – Pooling imputed, still not analysed datasets in MICE

A major point of multiple imputations is to do separate analyses on each of the imputed data sets, so that you can get both pooled estimates of things like regression coefficients and an estimate of the errors in the coefficients. Averaging the imputed data sets first is not the correct use of this approach. And don't limit yourself to so few imputations; with modern computers there's no reason not to do 100 or more. See http://www.stefvanbuuren.nl/mi/MI.html, from the person who developed the mice package, for further information.

Solved – Applying Rubin’s rule for combining multiply imputed datasets

Rubin's rules can only be applied to parameters following a normal distribution. For parameters with a F or Chi Square distribution a different set of formulas is needed:

Allison, P. D. (2002). Missing data. Newbury Park, CA: Sage.

For performing an ANOVA on multiple imputed datasets you could use the R package miceadds (pdf; miceadds::mi.anova).

Update 1

Here is a complete example:

Export your data from SPSS to R. In Spss save your dataset as .csv
Read in your dataset:
```
library(miceadds)   
dat <– read.csv(file='your-dataset.csv')
```
Lets assume, that $reading$ is your dependent variable and that you have two factors
- gender, with male = 0 and female = 1
- treatment, with control = 0 and 'received treatment' = 1
Now lets convert them to factors:
```
dat$gender    <- factor(dat$gender)
dat$treatment <- factor(dat$treatment)
```
Convert your dataset to a mids object, wehere we assume, that the first variable holds the imputation number (Imputation_ in SPSS):
```
dat.mids <- as.mids(dat)
```

Now you can perform an ANOVA:

fit <- mi.anova(mi.res=dat.mids, formula="reading~gender*treatment", type=3)
summary(fit)

Update 2 This is a reply to your second comment:

What you describe here is a data import/export related problem between SPSS and R. You could try to import the .sav file directly into R and there are a bunch of dedicated packages for that: foreign, rio, gdata, Hmisc, etc. I prefer the csv-way, but that's a matter of taste and/or depends on the nature of your problem. Maybe you should also check some tutorials on youtube or other sources on the internet.

library(foreign)
dat <- read.spss(file='path-to-sav', use.value.labels=F, to.data.frame=T)

Update 3 This is a reply to your first comment:

Yes, you can do your analysis in SPSS and pool the F values in miceadds (please note this example is taken from the miceadds::micombine.F help page):

library(miceadds)
Fvalues <- c(6.76 , 4.54 , 4.23 , 5.45 , 4.78, 6.76 , 4.54 , 4.23 , 5.45 , 4.78, 
             6.76 , 4.54 , 4.23 , 5.45 , 4.78, 6.76 , 4.54 , 4.23 , 5.45 , 4.78 )
micombine(Fvalues, df1=4)

Best Answer

Update

Related Solutions

Solved – Pooling imputed, still not analysed datasets in MICE

Solved – Applying Rubin’s rule for combining multiply imputed datasets

Related Question