Solved – Applying Rubin’s rule for combining multiply imputed datasets

missing datamultiple-imputationpoolingspss

I am hoping to pool the results of a pretty basic set of analysis performed on a multiply imputed data (e.g. multiple regression, ANOVA). Multiple imputation and the analyses have been completed in SPSS but SPSS does not provide pooled results for a few statistics including the F-value, covariance matrix, R-Squared etc.

I have made a few attempts to address this issue by venturing into R or trying macros that are available and have not successfully resolved the problem (e.g. with running into issues with pooling the stats for more than 5 imputations in Mice, for example).

At this point, I would like to try computing these by hand, applying Rubin's rule, using the output that SPSS generates. However, I am not sure how I can derive the within-imputation variance ($\bar U = \frac 1 m\sum_{j=1}^mU_j$) based on the output SPSS generates.

I would really appreciate a detailed instruction on this.

Best Answer

Rubin's rules can only be applied to parameters following a normal distribution. For parameters with a F or Chi Square distribution a different set of formulas is needed:

Allison, P. D. (2002). Missing data. Newbury Park, CA: Sage.

For performing an ANOVA on multiple imputed datasets you could use the R package miceadds (pdf; miceadds::mi.anova).

Update 1

Here is a complete example:

Export your data from SPSS to R. In Spss save your dataset as .csv
Read in your dataset:
```
library(miceadds)   
dat <– read.csv(file='your-dataset.csv')
```
Lets assume, that $reading$ is your dependent variable and that you have two factors
- gender, with male = 0 and female = 1
- treatment, with control = 0 and 'received treatment' = 1
Now lets convert them to factors:
```
dat$gender    <- factor(dat$gender)
dat$treatment <- factor(dat$treatment)
```
Convert your dataset to a mids object, wehere we assume, that the first variable holds the imputation number (Imputation_ in SPSS):
```
dat.mids <- as.mids(dat)
```

Now you can perform an ANOVA:

fit <- mi.anova(mi.res=dat.mids, formula="reading~gender*treatment", type=3)
summary(fit)

Update 2 This is a reply to your second comment:

What you describe here is a data import/export related problem between SPSS and R. You could try to import the .sav file directly into R and there are a bunch of dedicated packages for that: foreign, rio, gdata, Hmisc, etc. I prefer the csv-way, but that's a matter of taste and/or depends on the nature of your problem. Maybe you should also check some tutorials on youtube or other sources on the internet.

library(foreign)
dat <- read.spss(file='path-to-sav', use.value.labels=F, to.data.frame=T)

Update 3 This is a reply to your first comment:

Yes, you can do your analysis in SPSS and pool the F values in miceadds (please note this example is taken from the miceadds::micombine.F help page):

library(miceadds)
Fvalues <- c(6.76 , 4.54 , 4.23 , 5.45 , 4.78, 6.76 , 4.54 , 4.23 , 5.45 , 4.78, 
             6.76 , 4.54 , 4.23 , 5.45 , 4.78, 6.76 , 4.54 , 4.23 , 5.45 , 4.78 )
micombine(Fvalues, df1=4)

Related Solutions

Solved – How to get pooled p-values on tests done in multiple imputed datasets

Yes, it is possible and, yes, there are R functions that do it. Instead of computing the p-values of the repeated analyses by hand, you can use the package Zelig, which is also referred to in the vignette of the Amelia-package (for a more informative method see my update below). I'll use an example from the Amelia-vignette to demonstrate this:

library("Amelia")
data(freetrade)
amelia.out <- amelia(freetrade, m = 15, ts = "year", cs = "country")

library("Zelig")
zelig.fit <- zelig(tariff ~ pop + gdp.pc + year + polity, data = amelia.out$imputations, model = "ls", cite = FALSE)
summary(zelig.fit)

This is the corresponding output including $p$-values:

  Model: ls
  Number of multiply imputed data sets: 15 

Combined results:

Call:
lm(formula = formula, weights = weights, model = F, data = data)

Coefficients:
                Value Std. Error t-stat  p-value
(Intercept)  3.18e+03   7.22e+02   4.41 6.20e-05
pop          3.13e-08   5.59e-09   5.59 4.21e-08
gdp.pc      -2.11e-03   5.53e-04  -3.81 1.64e-04
year        -1.58e+00   3.63e-01  -4.37 7.11e-05
polity       5.52e-01   3.16e-01   1.75 8.41e-02

For combined results from datasets i to j, use summary(x, subset = i:j).
For separate results, use print(summary(x), subset = i:j).

zelig can fit a host of models other than least squares.

To get confidence intervals and degrees of freedom for your estimates you can use mitools:

library("mitools")
imp.data <- imputationList(amelia.out$imputations)
mitools.fit <- MIcombine(with(imp.data, lm(tariff ~ polity + pop + gdp.pc + year)))
mitools.res <- summary(mitools.fit)
mitools.res <- cbind(mitools.res, df = mitools.fit$df)
mitools.res

This will give you confidence intervals and proportion of the total variance that is attributable to the missing data:

              results       se    (lower    upper) missInfo    df
(Intercept)  3.18e+03 7.22e+02  1.73e+03  4.63e+03     57 %  45.9
pop          3.13e-08 5.59e-09  2.03e-08  4.23e-08     19 % 392.1
gdp.pc      -2.11e-03 5.53e-04 -3.20e-03 -1.02e-03     21 % 329.4
year        -1.58e+00 3.63e-01 -2.31e+00 -8.54e-01     57 %  45.9
polity       5.52e-01 3.16e-01 -7.58e-02  1.18e+00     41 %  90.8

Of course you can just combine the interesting results into one object:

combined.results <- merge(mitools.res, zelig.res$coefficients[, c("t-stat", "p-value")], by = "row.names", all.x = TRUE)

Update

After some playing around, I have found a more flexible way to get all necessary information using the mice-package. For this to work, you'll need to modify the package's as.mids()-function. Use Gerko's version posted in my follow-up question:

as.mids2 <- function(data2, .imp=1, .id=2){
  ini <- mice(data2[data2[, .imp] == 0, -c(.imp, .id)], m = max(as.numeric(data2[, .imp])), maxit=0)
  names  <- names(ini$imp)
  if (!is.null(.id)){
    rownames(ini$data) <- data2[data2[, .imp] == 0, .id]
  }
  for (i in 1:length(names)){
    for(m in 1:(max(as.numeric(data2[, .imp])))){
      if(!is.null(ini$imp[[i]])){
        indic <- data2[, .imp] == m & is.na(data2[data2[, .imp]==0, names[i]])
        ini$imp[[names[i]]][m] <- data2[indic, names[i]]
      }
    } 
  }
  return(ini)
}

With this defined, you can go on to analyze the imputed data sets:

library("mice")
imp.data <- do.call("rbind", amelia.out$imputations)
imp.data <- rbind(freetrade, imp.data)
imp.data$.imp <- as.numeric(rep(c(0:15), each = nrow(freetrade)))
mice.data <- as.mids2(imp.data, .imp = ncol(imp.data), .id = NULL)

mice.fit <- with(mice.data, lm(tariff ~ polity + pop + gdp.pc + year))
mice.res <- summary(pool(mice.fit, method = "rubin1987"))

This will give you all results you get using Zelig and mitools and more:

                  est       se     t    df Pr(>|t|)     lo 95     hi 95 nmis   fmi lambda
(Intercept)  3.18e+03 7.22e+02  4.41  45.9 6.20e-05  1.73e+03  4.63e+03   NA 0.571  0.552
pop          3.13e-08 5.59e-09  5.59 392.1 4.21e-08  2.03e-08  4.23e-08    0 0.193  0.189
gdp.pc      -2.11e-03 5.53e-04 -3.81 329.4 1.64e-04 -3.20e-03 -1.02e-03    0 0.211  0.206
year        -1.58e+00 3.63e-01 -4.37  45.9 7.11e-05 -2.31e+00 -8.54e-01    0 0.570  0.552
polity       5.52e-01 3.16e-01  1.75  90.8 8.41e-02 -7.58e-02  1.18e+00    2 0.406  0.393

Note, using pool() you can also calculate $p$-values with $df$ adjusted for small samples by omitting the method-parameter. What is even better, you can now also calculate $R^2$ and compare nested models:

pool.r.squared(mice.fit)

mice.fit2 <- with(mice.data, lm(tariff ~ polity + pop + gdp.pc))
pool.compare(mice.fit, mice.fit2, method = "Wald")$pvalue

Solved – Pooling imputed, still not analysed datasets in MICE

A major point of multiple imputations is to do separate analyses on each of the imputed data sets, so that you can get both pooled estimates of things like regression coefficients and an estimate of the errors in the coefficients. Averaging the imputed data sets first is not the correct use of this approach. And don't limit yourself to so few imputations; with modern computers there's no reason not to do 100 or more. See http://www.stefvanbuuren.nl/mi/MI.html, from the person who developed the mice package, for further information.

Best Answer

Related Solutions

Solved – How to get pooled p-values on tests done in multiple imputed datasets

Update

Solved – Pooling imputed, still not analysed datasets in MICE

Related Question