Multiple Imputation – Confirming Cubic Spline on Imputed Datasets Using Mice Package

cox-modelmicemultiple-imputationrmssplines

I am performing restricted cubic spline (Cox proportional hazard ratio) after imputing 10 datasets using mice package.

My variables as follow:

Outcome: DM

Exposure: BMI

time to events: time

Covariates: age, sex, phy, smk2, alco

I am using the following codes:

library(mice)
library(Hmisc)
library (rms)
d<-datadist(DMs)
options(datadist = "d")
d$limits["Adjust to","BMI"] <- 25

#Multiple Imputation

DMs1<- mice::mice(DM,seed = 123, print = FALSE,m=10, maxit = 5)

#Restricted cubic spline model

models<-fit.mult.impute(Surv(time, DM) ~ rcs(BMI,4)+age+sex+phy+
                      smk2+alco, fitter = cph, xtrans = DM1,data = DMs,x=TRUE, Y=TRUE)

#applying the model in the data

dataplot<-Predict(models,BMI=seq(18,40,by=2),ref.zero = TRUE, fun = exp)

#ploting data

ggplot(dataplot,aes(BMI,yhat))

My question is how to confirm that restricted cubic spline was done on 10 datasets and the estimate was the pooled based on Rubin's rule?

Thank you for your support.

Best Answer

When you have a question about how multiple imputations are handled, compare the results on the individual imputed data sets to what was reported for the pooled imputations. I don't have your data, but the jasa dataset from the survival package has some missing data in the mscore variable (among others) to use as an example. I got a warning related to Date objects when I tried to use all the columns, so omit the first 3.

library(mice)
library(rms) ## also loads Hmisc, survival, etc.
ddJ <- datadist(jasa[,4:14])
options(datadist="ddJ")
m1 <- mice(jasa[4:14],m=10,seed=123,print=FALSE,maxit=5)
models <- fit.mult.impute(Surv(futime,fustat)~rcs(age,3) + mscore,fitter=cph,xtrans=m1,data= jasa[,4:14],x=TRUE,y=TRUE)

# Variance Inflation Factors Due to Imputation:
# 
#    age   age' mscore 
#   1.01   1.00   1.52 
# 
# Rate of Missing Information:
# 
#    age   age' mscore 
#   0.01   0.00   0.34 
# 
# d.f. for t-distribution for Tests of Single Coefficients:
# 
#        age       age'     mscore 
#  175205.19 1356056.66      76.09 
# 
# The following fit components were averaged over the 10 model fits:
# 
#   linear.predictors means stats center 
#

That output on its own should start to set your mind at ease, as it states how much the results are affected by imputation.

You can ask the mice package to retrieve a list of all 10 imputed data sets.

c1 <- complete(m1,action="all")

As you might expect, the coefficients reported for models are the averages of those for the individual models.

indivCoefs <- NULL
for(i in 1:10) indivCoefs <- rbind(indivCoefs,coef(cph(Surv(futime,fustat)~rcs(age,3) + mscore,data=c1[[i]],x=TRUE,y=TRUE)))
indivCoefs
#                 age       age'     mscore
#  [1,] -0.0007819043 0.03417368 0.02641657
#  [2,] -0.0016521256 0.03251149 0.35839812
#  [3,] -0.0027189336 0.03357772 0.24265117
#  [4,] -0.0024670263 0.03256148 0.35220641
#  [5,] -0.0043502230 0.03500199 0.30231338
#  [6,] -0.0026773902 0.03471552 0.17250318
#  [7,] -0.0030487879 0.03435990 0.38502635
#  [8,]  0.0002067658 0.03332910 0.30456262
#  [9,]  0.0022994480 0.03166086 0.53495959
# [10,] -0.0010426317 0.03289308 0.37069055

apply(indivCoefs,MARGIN=2,FUN=mean)
#          age         age'       mscore 
# -0.001623281  0.033478483  0.304972793

coef(models)
#         age         age'       mscore 
# -0.001623281  0.033478483  0.304972793

The age' coefficient is specific to the rcs() function, so that function was used in all model fits. One caution: the documentation for fit.mult.impute() specifies that the knot positions are kept constant after they are chosen for the first imputed data set.

In contrast, the pooled variance/covariance matrix isn't the average of the 10 individual matrices.

indivVcov <- vcov(cph(Surv(futime,fustat)~rcs(age,3) + mscore,data=c1[[1]],x=TRUE,y=TRUE))
for(i in 2:10) indivVcov <-indivVcov+vcov(cph(Surv(futime,fustat)~rcs(age,3) + mscore,data=c1[[i]],x=TRUE,y=TRUE)))

indivVcov/10
#                  age          age'        mscore
# age     0.0005464355 -0.0004401314 -0.0002257020
# age'   -0.0004401314  0.0004999929 -0.0000294507
# mscore -0.0002257020 -0.0000294507  0.0391095587

vcov(models)
#                  age          age'        mscore
# age     0.0005503802 -0.0004417423 -0.0001322173
# age'   -0.0004417423  0.0005012843 -0.0001331701
# mscore -0.0001322173 -0.0001331701  0.0596110650

In particular, notice the higher pooled variance for mscore than the average over the 10 imputations. That because additional computations applying Rubin's rules have been used.

Some of the confusion might come from the following sentence from the fit.mult.impute help documentation (within the transcan documentation):

fit.mult.impute returns a fit object that is a modification of the fit object created by fitting the completed dataset for the final imputation.

You don't get a set of models from it. Instead you get a single pooled model. But it's not the fit for "the completed dataset for the final imputation," which is simply used as a template. The documentation then goes on to say:

The var matrix in the fit object has the imputation-corrected variance-covariance matrix.

That's what's returned by vcov(models). All that Predict() has to work with is that single pooled model, with the averaged coefficients and the imputation-corrected coefficient covariance matrix.

Update

After some playing around, I have found a more flexible way to get all necessary information using the mice-package. For this to work, you'll need to modify the package's as.mids()-function. Use Gerko's version posted in my follow-up question:

as.mids2 <- function(data2, .imp=1, .id=2){
  ini <- mice(data2[data2[, .imp] == 0, -c(.imp, .id)], m = max(as.numeric(data2[, .imp])), maxit=0)
  names  <- names(ini$imp)
  if (!is.null(.id)){
    rownames(ini$data) <- data2[data2[, .imp] == 0, .id]
  }
  for (i in 1:length(names)){
    for(m in 1:(max(as.numeric(data2[, .imp])))){
      if(!is.null(ini$imp[[i]])){
        indic <- data2[, .imp] == m & is.na(data2[data2[, .imp]==0, names[i]])
        ini$imp[[names[i]]][m] <- data2[indic, names[i]]
      }
    } 
  }
  return(ini)
}

With this defined, you can go on to analyze the imputed data sets:

library("mice")
imp.data <- do.call("rbind", amelia.out$imputations)
imp.data <- rbind(freetrade, imp.data)
imp.data$.imp <- as.numeric(rep(c(0:15), each = nrow(freetrade)))
mice.data <- as.mids2(imp.data, .imp = ncol(imp.data), .id = NULL)

mice.fit <- with(mice.data, lm(tariff ~ polity + pop + gdp.pc + year))
mice.res <- summary(pool(mice.fit, method = "rubin1987"))

This will give you all results you get using Zelig and mitools and more:

                  est       se     t    df Pr(>|t|)     lo 95     hi 95 nmis   fmi lambda
(Intercept)  3.18e+03 7.22e+02  4.41  45.9 6.20e-05  1.73e+03  4.63e+03   NA 0.571  0.552
pop          3.13e-08 5.59e-09  5.59 392.1 4.21e-08  2.03e-08  4.23e-08    0 0.193  0.189
gdp.pc      -2.11e-03 5.53e-04 -3.81 329.4 1.64e-04 -3.20e-03 -1.02e-03    0 0.211  0.206
year        -1.58e+00 3.63e-01 -4.37  45.9 7.11e-05 -2.31e+00 -8.54e-01    0 0.570  0.552
polity       5.52e-01 3.16e-01  1.75  90.8 8.41e-02 -7.58e-02  1.18e+00    2 0.406  0.393

Note, using pool() you can also calculate $p$-values with $df$ adjusted for small samples by omitting the method-parameter. What is even better, you can now also calculate $R^2$ and compare nested models:

pool.r.squared(mice.fit)

mice.fit2 <- with(mice.data, lm(tariff ~ polity + pop + gdp.pc))
pool.compare(mice.fit, mice.fit2, method = "Wald")$pvalue

Multiple Imputation – Applying Rubin’s Rule for Combining Multiply Imputed Datasets

Rubin's rules can only be applied to parameters following a normal distribution. For parameters with a F or Chi Square distribution a different set of formulas is needed:

Allison, P. D. (2002). Missing data. Newbury Park, CA: Sage.

For performing an ANOVA on multiple imputed datasets you could use the R package miceadds (pdf; miceadds::mi.anova).

Update 1

Here is a complete example:

Export your data from SPSS to R. In Spss save your dataset as .csv
Read in your dataset:
```
library(miceadds)   
dat <– read.csv(file='your-dataset.csv')
```
Lets assume, that $reading$ is your dependent variable and that you have two factors
- gender, with male = 0 and female = 1
- treatment, with control = 0 and 'received treatment' = 1
Now lets convert them to factors:
```
dat$gender    <- factor(dat$gender)
dat$treatment <- factor(dat$treatment)
```
Convert your dataset to a mids object, wehere we assume, that the first variable holds the imputation number (Imputation_ in SPSS):
```
dat.mids <- as.mids(dat)
```

Now you can perform an ANOVA:

fit <- mi.anova(mi.res=dat.mids, formula="reading~gender*treatment", type=3)
summary(fit)

Update 2 This is a reply to your second comment:

What you describe here is a data import/export related problem between SPSS and R. You could try to import the .sav file directly into R and there are a bunch of dedicated packages for that: foreign, rio, gdata, Hmisc, etc. I prefer the csv-way, but that's a matter of taste and/or depends on the nature of your problem. Maybe you should also check some tutorials on youtube or other sources on the internet.

library(foreign)
dat <- read.spss(file='path-to-sav', use.value.labels=F, to.data.frame=T)

Update 3 This is a reply to your first comment:

Yes, you can do your analysis in SPSS and pool the F values in miceadds (please note this example is taken from the miceadds::micombine.F help page):

library(miceadds)
Fvalues <- c(6.76 , 4.54 , 4.23 , 5.45 , 4.78, 6.76 , 4.54 , 4.23 , 5.45 , 4.78, 
             6.76 , 4.54 , 4.23 , 5.45 , 4.78, 6.76 , 4.54 , 4.23 , 5.45 , 4.78 )
micombine(Fvalues, df1=4)

Best Answer

Related Solutions

Pooled P-Values – How to Get from Multiple Imputed Datasets?

Update

Multiple Imputation – Applying Rubin’s Rule for Combining Multiply Imputed Datasets

Related Question