Solved – Best method to validate a multiply imputed Cox model with R

predictive-modelsrrmssurvivalvalidation

This question is with regards to using a test data set to validate an imputed Cox model using R. With a non-imputed data set I would use val.surv() from rms, but I'm not sure how/if I can use it with my multiply imputed data set.

Further explanation:
I created a predictive Cox PH model for 5-year RFS, and also used the mice package to multiply impute some missing data in the training data set. I then used fit.mult.impute() in Dr. Frank Harrell's excellent Hmisc package to obtain the pooled model. I have a data set that I would like to test the model on, but I am not sure how best to validate the pooled model.

Multiple imputation is a common procedure that many researchers utilize, so there must be a way that R users are validating their models with imputed data. I would like to know what functions/avenues are available for me to test this pooled model with my validation data set? Here is some sample code to work with:

library(rms)
library(survival)
library(mice)

remove(veteran)
data(veteran)
veteran$trt=factor(veteran$trt,levels=c(1,2))
veteran$prior=factor(veteran$prior,levels=c(0,10))

#Set random data to NA 
veteran[sample(137,4),1]=NA
veteran[sample(137,4),2]=NA
veteran[sample(137,4),7]=NA

impvet=mice(veteran)
survmod=with(veteran,Surv(time,status))

#make a CPH for each imputation
for(i in seq(5)){
    assign(paste("mod_",i,sep=""),cph(survmod~celltype+karno,
        data=complete(impvet,i),x=T,y=T))
}

#Now there is a CPH model for mod_1, mod_2, mod_3, mod_4, and mod_5.

pooled_mod=fit.mult.impute(survmod~celltype+karno,cph,impvet,data=veteran,surv=T)

#Here is a test data set.
remove(veteran)
test_dat=data.frame(trt=replicate(500,NA), celltype=replicate(500,NA), time=replicate(500,NA), status=replicate(500,NA), karno=replicate(500,NA), diagtime=replicate(500,NA), age=replicate(500,NA), prior=replicate(500,NA))
for(i in seq(8)){
test_dat[,i]=sample(veteran[,i],500,replace=T)
}

#Now there is a pooled model, "pooled_mod", and a test data set, "test_dat".

I'm looking forward to hearing about the R methods that can help in this situation.

Best Answer

I would suggest that you repeat every step that you would otherwise do on the non-imputed data for the first imputed data set. For example:

# fit the model on the multiply imputed data
fit <- with(impvet, cph(survmod ~ celltype + karno, x = T, y = T))

# take out the first imputed data set
data <- complete(fit, 1)

# take out the first fitted cph model
mod <- fit$analyses[[1]]

# use your val.surv() steps here on the first imputed data set
...

Of course, you can also study the second, third, … dataset.

Related Question