Solved – Multiple Imputation how to get one dataset out m=50

multiple-imputationr

So I am new to R and new to MI as well. Reading through "Flexible Imputation of Missing Data" and slowly becoming acquainted.

I was going through a sample run of my data, worked through most of the necessary steps outlined in book and "mice: Multivariate Imputation by Chained
Equations in R" paper…but I can't come up with a code that lets you take all the 50 datasets and just get one output data with imputed/oroginal values. Is this even possible with MICE or MI in general? I tried Amelia and Zelig too and I get confused at this step. I get the point that one should pool through all the datasets but going through all 50 in this sample run would be too exhausting.

Since I am still new, I apologize for missing something and would appreciate an example or code that lets me pool one dataset to model with through an index in another software.

Best Answer

It seems that you want to stack the imputed datasets. As noted by those who have commented previously, this is not the best way to analyse the data (point estimates tend to be accurate, but the variability accounted for by the imputation process is no longer present and error will be reduced). Nevertheless, stacking the data is achieved by using the complete function in the mice package. Once stacked, the data can be exported easily to other software programs.

# Impute the data using the default options
imp <- mice(df)

# Check convergence
plot(imp)

# Stack imputed data into one LONG dataset (generates two new variables indicating id and imputation number); raw (unimputed) data are appended (inc = TRUE)
com <- complete(imp, "long", inc = TRUE)

# Obtain first imputed dataset
com <- complete(imp)
com <- complete(imp, 1)

# Obtain second imputed dataset
com <- complete(imp, 2)

It is also possible to export the mids object (imp) directly to SPSS (if that is your other software) using the mids2spss function in mice.

Related Solutions

Solved – How does the mice imputation function work

By default, mice will use all the variables in your dataset to predict any other one.

As for averaging, you need to do this after calculating your stats, not before. For instance, if you want to do a linear regression, you'd do something like this:

library(mice)
mi <- mice(dataset)
mi.reg <- with(data=mi,exp=glm(y~x+z))
mi.reg.pool <- pool(mi.reg)
summary(mi.reg.pool)

The summary function will show you the averaged coefficients.

Solved – How to get pooled p-values on tests done in multiple imputed datasets

Yes, it is possible and, yes, there are R functions that do it. Instead of computing the p-values of the repeated analyses by hand, you can use the package Zelig, which is also referred to in the vignette of the Amelia-package (for a more informative method see my update below). I'll use an example from the Amelia-vignette to demonstrate this:

library("Amelia")
data(freetrade)
amelia.out <- amelia(freetrade, m = 15, ts = "year", cs = "country")

library("Zelig")
zelig.fit <- zelig(tariff ~ pop + gdp.pc + year + polity, data = amelia.out$imputations, model = "ls", cite = FALSE)
summary(zelig.fit)

This is the corresponding output including $p$-values:

  Model: ls
  Number of multiply imputed data sets: 15 

Combined results:

Call:
lm(formula = formula, weights = weights, model = F, data = data)

Coefficients:
                Value Std. Error t-stat  p-value
(Intercept)  3.18e+03   7.22e+02   4.41 6.20e-05
pop          3.13e-08   5.59e-09   5.59 4.21e-08
gdp.pc      -2.11e-03   5.53e-04  -3.81 1.64e-04
year        -1.58e+00   3.63e-01  -4.37 7.11e-05
polity       5.52e-01   3.16e-01   1.75 8.41e-02

For combined results from datasets i to j, use summary(x, subset = i:j).
For separate results, use print(summary(x), subset = i:j).

zelig can fit a host of models other than least squares.

To get confidence intervals and degrees of freedom for your estimates you can use mitools:

library("mitools")
imp.data <- imputationList(amelia.out$imputations)
mitools.fit <- MIcombine(with(imp.data, lm(tariff ~ polity + pop + gdp.pc + year)))
mitools.res <- summary(mitools.fit)
mitools.res <- cbind(mitools.res, df = mitools.fit$df)
mitools.res

This will give you confidence intervals and proportion of the total variance that is attributable to the missing data:

              results       se    (lower    upper) missInfo    df
(Intercept)  3.18e+03 7.22e+02  1.73e+03  4.63e+03     57 %  45.9
pop          3.13e-08 5.59e-09  2.03e-08  4.23e-08     19 % 392.1
gdp.pc      -2.11e-03 5.53e-04 -3.20e-03 -1.02e-03     21 % 329.4
year        -1.58e+00 3.63e-01 -2.31e+00 -8.54e-01     57 %  45.9
polity       5.52e-01 3.16e-01 -7.58e-02  1.18e+00     41 %  90.8

Of course you can just combine the interesting results into one object:

combined.results <- merge(mitools.res, zelig.res$coefficients[, c("t-stat", "p-value")], by = "row.names", all.x = TRUE)

Update

After some playing around, I have found a more flexible way to get all necessary information using the mice-package. For this to work, you'll need to modify the package's as.mids()-function. Use Gerko's version posted in my follow-up question:

as.mids2 <- function(data2, .imp=1, .id=2){
  ini <- mice(data2[data2[, .imp] == 0, -c(.imp, .id)], m = max(as.numeric(data2[, .imp])), maxit=0)
  names  <- names(ini$imp)
  if (!is.null(.id)){
    rownames(ini$data) <- data2[data2[, .imp] == 0, .id]
  }
  for (i in 1:length(names)){
    for(m in 1:(max(as.numeric(data2[, .imp])))){
      if(!is.null(ini$imp[[i]])){
        indic <- data2[, .imp] == m & is.na(data2[data2[, .imp]==0, names[i]])
        ini$imp[[names[i]]][m] <- data2[indic, names[i]]
      }
    } 
  }
  return(ini)
}

With this defined, you can go on to analyze the imputed data sets:

library("mice")
imp.data <- do.call("rbind", amelia.out$imputations)
imp.data <- rbind(freetrade, imp.data)
imp.data$.imp <- as.numeric(rep(c(0:15), each = nrow(freetrade)))
mice.data <- as.mids2(imp.data, .imp = ncol(imp.data), .id = NULL)

mice.fit <- with(mice.data, lm(tariff ~ polity + pop + gdp.pc + year))
mice.res <- summary(pool(mice.fit, method = "rubin1987"))

This will give you all results you get using Zelig and mitools and more:

                  est       se     t    df Pr(>|t|)     lo 95     hi 95 nmis   fmi lambda
(Intercept)  3.18e+03 7.22e+02  4.41  45.9 6.20e-05  1.73e+03  4.63e+03   NA 0.571  0.552
pop          3.13e-08 5.59e-09  5.59 392.1 4.21e-08  2.03e-08  4.23e-08    0 0.193  0.189
gdp.pc      -2.11e-03 5.53e-04 -3.81 329.4 1.64e-04 -3.20e-03 -1.02e-03    0 0.211  0.206
year        -1.58e+00 3.63e-01 -4.37  45.9 7.11e-05 -2.31e+00 -8.54e-01    0 0.570  0.552
polity       5.52e-01 3.16e-01  1.75  90.8 8.41e-02 -7.58e-02  1.18e+00    2 0.406  0.393

Note, using pool() you can also calculate $p$-values with $df$ adjusted for small samples by omitting the method-parameter. What is even better, you can now also calculate $R^2$ and compare nested models:

pool.r.squared(mice.fit)

mice.fit2 <- with(mice.data, lm(tariff ~ polity + pop + gdp.pc))
pool.compare(mice.fit, mice.fit2, method = "Wald")$pvalue

Best Answer

Related Solutions

Solved – How does the mice imputation function work

Solved – How to get pooled p-values on tests done in multiple imputed datasets

Update

Related Question