Solved – How to validate Cox Proportional Hazards model

cox-modelcross-validationsurvivalvalidation

I'm using a Cox proportional Hazards regression (R survival package) to model Credit card activation propension, ie, which people are more likely to make their first buy? To give more context: Defining target variable – Credit Card industry.

So I have:

birth: Credit card creation

death: Date of first buy

event: people use their card for the first time

Here's the model summary:

## Call:
## coxph(formula = Surv(TIME, EVENT) ~ IDADE_EMPRESA + ZERO_RATIO + 
##     AVG_VENDAS + UF_CE + UF_ES + UF_DF + VL_LIMITE_COMPRA_ORIGINAL + 
##     VL_LIMITE_PARCEIRO + SD_VENDAS, data = x)
## 
##   n= 32548, number of events= 1999 

## Concordance= 0.716  (se = 0.007 )
## Rsquare= 0.038   (max possible= 0.706 )
## Likelihood ratio test= 1252  on 9 df,   p=0
## Wald test            = 1326  on 9 df,   p=0
## Score (logrank) test = 1318  on 9 df,   p=0

What I have done so far: used 9 months of data to fit the model and 3 remaining months as a holdout validation set. Now, I'm not sure how to use the validation set, what I would like to do is the following:

  • Rank the clients who are more likely to buy within 30,60,90 days (ie, I don't want the the Survival estimation T > 30,60,90), then estimate AUC or Concordance index for each time period.

Is that even possible? What are the alternatives for reporting accuracy? I have checked http://dni-institute.in/blogs/cox-regression-interpret-result-and-predict/, but it seems they are doing the opposite of what I need.

NOTE: Survival analysis is new to me, but I'm well familiar with general ML concepts like Cross Validation, Overfitting and so on. Thanks!

EDIT1: I've found the survAUC package, but I'm not sure if i understood the parameters:

  train = get.data(is.train=TRUE)
  test =  get.data(is.train=FALSE)

  fit = fit.surv() # get coxph model

  surv.train = Surv(train$TIME, train$EVENT)
  surv.test = Surv(test$TIME, test$EVENT)
  lp = predict(fit, test)
  # returns 0.7270601 0.7272526 0.7274083
  AUC.cd(surv.train, surv.test, predict(fit), predict(fit, test), c(30, 60, 90))

EDIT2: Another option, survConcordance in the survival package:

  fit = fit.surv()
  test =  get.data(is.train=FALSE)
  surv.test = Surv(test$TIME, test$EVENT)

  survConcordance(surv.test ~ predict(fit, test), data = test)

 # Outputs
   n= 428 
   Concordance= 0.7799616 se= 0.03275571
   concordant discordant  tied.risk  tied.time   std(c-d) 
   23533.00    6639.00       0.00     144.00    1976.61 

I'm really not sure about what these lines above are doing, I appreciate any help on this!

Best Answer

This is not what you do to validate an event time model. You need a smooth calibration curve at each of a series of time horizons plus validation of predictive discrimination, e.g., Somers' Dxy rank correlation (c-index). The R rms package makes this easy, and it can use the bootstrap to correct for overfitting if you are honest about including all candidate variables in the model. See my course notes for details: http://biostat.mc.vanderbilt.edu/rms

Related Question