This question is with regards to using a test data set to validate an imputed Cox model using R. With a non-imputed data set I would use val.surv()
from rms
, but I'm not sure how/if I can use it with my multiply imputed data set.
Further explanation:
I created a predictive Cox PH model for 5-year RFS, and also used the mice
package to multiply impute some missing data in the training data set. I then used fit.mult.impute()
in Dr. Frank Harrell's excellent Hmisc
package to obtain the pooled model. I have a data set that I would like to test the model on, but I am not sure how best to validate the pooled model.
Multiple imputation is a common procedure that many researchers utilize, so there must be a way that R users are validating their models with imputed data. I would like to know what functions/avenues are available for me to test this pooled model with my validation data set? Here is some sample code to work with:
library(rms)
library(survival)
library(mice)
remove(veteran)
data(veteran)
veteran$trt=factor(veteran$trt,levels=c(1,2))
veteran$prior=factor(veteran$prior,levels=c(0,10))
#Set random data to NA
veteran[sample(137,4),1]=NA
veteran[sample(137,4),2]=NA
veteran[sample(137,4),7]=NA
impvet=mice(veteran)
survmod=with(veteran,Surv(time,status))
#make a CPH for each imputation
for(i in seq(5)){
assign(paste("mod_",i,sep=""),cph(survmod~celltype+karno,
data=complete(impvet,i),x=T,y=T))
}
#Now there is a CPH model for mod_1, mod_2, mod_3, mod_4, and mod_5.
pooled_mod=fit.mult.impute(survmod~celltype+karno,cph,impvet,data=veteran,surv=T)
#Here is a test data set.
remove(veteran)
test_dat=data.frame(trt=replicate(500,NA), celltype=replicate(500,NA), time=replicate(500,NA), status=replicate(500,NA), karno=replicate(500,NA), diagtime=replicate(500,NA), age=replicate(500,NA), prior=replicate(500,NA))
for(i in seq(8)){
test_dat[,i]=sample(veteran[,i],500,replace=T)
}
#Now there is a pooled model, "pooled_mod", and a test data set, "test_dat".
I'm looking forward to hearing about the R methods that can help in this situation.
Best Answer
I would suggest that you repeat every step that you would otherwise do on the non-imputed data for the first imputed data set. For example:
Of course, you can also study the second, third, … dataset.