Solved – Interpretation of survival risk predictions generated from R’s predictSurvProb

cox-modelprobabilityrself-studysurvival

I am starting to learn R and its power for giving survival predictions.

Abstract

I use the predictSurvProb function from the pec package to get predicted survival risks. My data has 5000 rows wherein each thousand falls into a certain time category. These time categories are separated by two-year intervals; thus for time periods one to five, these have 1000 records that fall within these ranges.

My data has 47 variables: the time and event; 20 numeric, 10 binary and 15 nominal predictive variables.

My steps to generate the risks are as follows:

  1. Fit a Cox model, xns are the rest of the predictive variables.

    coxmodel <- cph(Surv(time,event)~x1+x2+x3+x4+x5, 
        data=myData, surv=TRUE, x=TRUE, y=TRUE) 
    
  2. Declare time periods.

    myTime <- c(24,48,72,96,120) # months
    periods <- quantile(myTime, conf.int=TRUE)
    
  3. Get the prediction risks!

    survRiskPreds <- predictSurvProb(coxmodel, newdata=myData, times=periods)
    round(preds, digits=6)
    

Problem

(Note that these below are example results but nonetheless represent the phenomena that I'm facing.)

Since I have five different time periods, I expect each of the 1000 records to have optimistic results in their own time period for example:

                                  24      48      72      96     120 
Record survives for 30 months : 0.96251 0.81881 0.65141 0.46836 0.11000
Record survives for 60 months : ?       0.96251 0.81881 0.65141 0.46836
Record survives for 90 months : ?       ?       0.96251 0.81881 0.65141
Record survives for 100 months: ?       ?       ?       0.96251 0.81881 
Record survives for 150 months: ?       ?       ?       ?       0.96251

But in reality the results have the pattern for all records no matter what their time value is:

                                  24      48      72      96     120 
Record survives for 30 months : 0.96251 0.81881 0.65141 0.46836 0.11000
Record survives for 60 months : 0.98111 0.83809 0.71174 0.60173 0.21272
Record survives for 90 months : 0.95411 0.82209 0.69974 0.51936 0.20272
Record survives for 100 months: 0.99911 0.88909 0.60074 0.57736 0.21272
Record survives for 150 months: 0.98801 0.78109 0.68211 0.51436 0.33272

Sample data and code

I made a dummy dataset that more or less represents the one I currently have.

Below is the script I use to manipulate the data. I included all predictive variables in the Cox model.

library(survival)
library(pec) 
library(rms)

theData <- read.csv("dummy.csv",
 colClasses=c(
   "female"="factor",
   "bin1"="factor",
   "bin2"="factor",
   "marital"="factor",
   "yrDiagnosis"="factor",
   "response1"="factor",
   "response2"="factor",
   "response3"="factor",
   "country"="factor",
   "race"="factor"))
attach(theData)

# removed one predictive variable (yrDiagnosis) because it caused an error
coxmodel <- cph(Surv(time,event)~age+num1+num2+num3+num4+
  num5+female+bin1+bin2+marital+response1+
  response2+response3+response4+response5+response6+country+race  
  , data=myData, surv=TRUE, x=TRUE, y=TRUE) 
coxmodel 

myTime <- c(24,48,72,96,120)
periods <- quantile(myTime, conf.int=TRUE)
periods

preds <- predictSurvProb(coxmodel, newdata=theData, times=periods)
round(preds[1:5,], digits=6)       # survival within 24 months
round(preds[1000:1005,], digits=6) # survival within 48 months
round(preds[2000:2005,], digits=6) # survival within 72 months
round(preds[3000:3005,], digits=6) # survival within 96 months
round(preds[4000:4005,], digits=6) # survival within 120 months

Am I doing something wrong ie. wrong format in the predictSurvProb steps?

I am not familiar with other R packages but I would appreciate if anyone guides me get to my desired result.

Best Answer

Through the predictSurvProb-function you use the model to predict the survival probability (probability to be alive at that specific timepoint) for each individual at each timepoint you enter in the times-argument. Theoretically, if the model predictions would perfectly correspond to the observed situation, the predicted survival probability would be 0 for an observation who died before the particular timepoint for which you want the prediction. However, in reality this is never the case and the model provides you with a predicted probability between 0 and 1 which is most likely for a given combination of covariables. Thus, for individual observations, the predicted suvival probability for 1 person will never be equal to the observed probability as the latter is either 0 (death) or 1 (alive). However, depending on how well your model performs, the predicted probabilities should correspond to the observed probabilities on a group/population level. (E.g. for observations with a certain combination of predictor values, on average, the predicted survival probabilities should correspond to the observed survival (KM-estimates)).

This is also why you will have to validate your model by assessing the predictive performance. For discrimination see the functions cindex-from the pec-package or validatefrom the rms-package. To evaluate the calibration it is advisable to make a calibration plot. Futhermore, to assess the clinical benefit of a model you could consider decision curve analyses (see paper by Vickers and this overview paper). For further reading also see the books by Harrell and Steyerberg.

P.S. There is no need to take quantiles for your timepoints, just enter c(24,48,72,96,120) in the times-argument when calling predictSurvProb.

Related Question