Multiple Imputation – Confirming Cubic Spline on Imputed Datasets Using Mice Package

cox-modelmicemultiple-imputationrmssplines

I am performing restricted cubic spline (Cox proportional hazard ratio) after imputing 10 datasets using mice package.

My variables as follow:

Outcome: DM

Exposure: BMI

time to events: time

Covariates: age, sex, phy, smk2, alco

I am using the following codes:

library(mice)
library(Hmisc)
library (rms)
d<-datadist(DMs)
options(datadist = "d")
d$limits["Adjust to","BMI"] <- 25

#Multiple Imputation

DMs1<- mice::mice(DM,seed = 123, print = FALSE,m=10, maxit = 5)

#Restricted cubic spline model

models<-fit.mult.impute(Surv(time, DM) ~ rcs(BMI,4)+age+sex+phy+
                      smk2+alco, fitter = cph, xtrans = DM1,data = DMs,x=TRUE, Y=TRUE)

#applying the model in the data

dataplot<-Predict(models,BMI=seq(18,40,by=2),ref.zero = TRUE, fun = exp)

#ploting data

ggplot(dataplot,aes(BMI,yhat))

My question is how to confirm that restricted cubic spline was done on 10 datasets and the estimate was the pooled based on Rubin's rule?

Thank you for your support.

Best Answer

When you have a question about how multiple imputations are handled, compare the results on the individual imputed data sets to what was reported for the pooled imputations. I don't have your data, but the jasa dataset from the survival package has some missing data in the mscore variable (among others) to use as an example. I got a warning related to Date objects when I tried to use all the columns, so omit the first 3.

library(mice)
library(rms) ## also loads Hmisc, survival, etc.
ddJ <- datadist(jasa[,4:14])
options(datadist="ddJ")
m1 <- mice(jasa[4:14],m=10,seed=123,print=FALSE,maxit=5)
models <- fit.mult.impute(Surv(futime,fustat)~rcs(age,3) + mscore,fitter=cph,xtrans=m1,data= jasa[,4:14],x=TRUE,y=TRUE)

# Variance Inflation Factors Due to Imputation:
# 
#    age   age' mscore 
#   1.01   1.00   1.52 
# 
# Rate of Missing Information:
# 
#    age   age' mscore 
#   0.01   0.00   0.34 
# 
# d.f. for t-distribution for Tests of Single Coefficients:
# 
#        age       age'     mscore 
#  175205.19 1356056.66      76.09 
# 
# The following fit components were averaged over the 10 model fits:
# 
#   linear.predictors means stats center 
#

That output on its own should start to set your mind at ease, as it states how much the results are affected by imputation.

You can ask the mice package to retrieve a list of all 10 imputed data sets.

c1 <- complete(m1,action="all")

As you might expect, the coefficients reported for models are the averages of those for the individual models.

indivCoefs <- NULL
for(i in 1:10) indivCoefs <- rbind(indivCoefs,coef(cph(Surv(futime,fustat)~rcs(age,3) + mscore,data=c1[[i]],x=TRUE,y=TRUE)))
indivCoefs
#                 age       age'     mscore
#  [1,] -0.0007819043 0.03417368 0.02641657
#  [2,] -0.0016521256 0.03251149 0.35839812
#  [3,] -0.0027189336 0.03357772 0.24265117
#  [4,] -0.0024670263 0.03256148 0.35220641
#  [5,] -0.0043502230 0.03500199 0.30231338
#  [6,] -0.0026773902 0.03471552 0.17250318
#  [7,] -0.0030487879 0.03435990 0.38502635
#  [8,]  0.0002067658 0.03332910 0.30456262
#  [9,]  0.0022994480 0.03166086 0.53495959
# [10,] -0.0010426317 0.03289308 0.37069055

apply(indivCoefs,MARGIN=2,FUN=mean)
#          age         age'       mscore 
# -0.001623281  0.033478483  0.304972793

coef(models)
#         age         age'       mscore 
# -0.001623281  0.033478483  0.304972793 

The age' coefficient is specific to the rcs() function, so that function was used in all model fits. One caution: the documentation for fit.mult.impute() specifies that the knot positions are kept constant after they are chosen for the first imputed data set.

In contrast, the pooled variance/covariance matrix isn't the average of the 10 individual matrices.

indivVcov <- vcov(cph(Surv(futime,fustat)~rcs(age,3) + mscore,data=c1[[1]],x=TRUE,y=TRUE))
for(i in 2:10) indivVcov <-indivVcov+vcov(cph(Surv(futime,fustat)~rcs(age,3) + mscore,data=c1[[i]],x=TRUE,y=TRUE)))

indivVcov/10
#                  age          age'        mscore
# age     0.0005464355 -0.0004401314 -0.0002257020
# age'   -0.0004401314  0.0004999929 -0.0000294507
# mscore -0.0002257020 -0.0000294507  0.0391095587

vcov(models)
#                  age          age'        mscore
# age     0.0005503802 -0.0004417423 -0.0001322173
# age'   -0.0004417423  0.0005012843 -0.0001331701
# mscore -0.0001322173 -0.0001331701  0.0596110650

In particular, notice the higher pooled variance for mscore than the average over the 10 imputations. That because additional computations applying Rubin's rules have been used.

Some of the confusion might come from the following sentence from the fit.mult.impute help documentation (within the transcan documentation):

fit.mult.impute returns a fit object that is a modification of the fit object created by fitting the completed dataset for the final imputation.

You don't get a set of models from it. Instead you get a single pooled model. But it's not the fit for "the completed dataset for the final imputation," which is simply used as a template. The documentation then goes on to say:

The var matrix in the fit object has the imputation-corrected variance-covariance matrix.

That's what's returned by vcov(models). All that Predict() has to work with is that single pooled model, with the averaged coefficients and the imputation-corrected coefficient covariance matrix.