I'm using a Cox proportional Hazards regression (R survival package) to model Credit card activation propension, ie, which people are more likely to make their first buy? To give more context: Defining target variable – Credit Card industry.
So I have:
birth: Credit card creation
death: Date of first buy
event: people use their card for the first time
Here's the model summary:
## Call:
## coxph(formula = Surv(TIME, EVENT) ~ IDADE_EMPRESA + ZERO_RATIO +
## AVG_VENDAS + UF_CE + UF_ES + UF_DF + VL_LIMITE_COMPRA_ORIGINAL +
## VL_LIMITE_PARCEIRO + SD_VENDAS, data = x)
##
## n= 32548, number of events= 1999
## Concordance= 0.716 (se = 0.007 )
## Rsquare= 0.038 (max possible= 0.706 )
## Likelihood ratio test= 1252 on 9 df, p=0
## Wald test = 1326 on 9 df, p=0
## Score (logrank) test = 1318 on 9 df, p=0
What I have done so far: used 9 months of data to fit the model and 3 remaining months as a holdout validation set. Now, I'm not sure how to use the validation set, what I would like to do is the following:
- Rank the clients who are more likely to buy within 30,60,90 days (ie, I don't want the the Survival estimation T > 30,60,90), then estimate AUC or Concordance index for each time period.
Is that even possible? What are the alternatives for reporting accuracy? I have checked http://dni-institute.in/blogs/cox-regression-interpret-result-and-predict/, but it seems they are doing the opposite of what I need.
NOTE: Survival analysis is new to me, but I'm well familiar with general ML concepts like Cross Validation, Overfitting and so on. Thanks!
EDIT1: I've found the survAUC package, but I'm not sure if i understood the parameters:
train = get.data(is.train=TRUE)
test = get.data(is.train=FALSE)
fit = fit.surv() # get coxph model
surv.train = Surv(train$TIME, train$EVENT)
surv.test = Surv(test$TIME, test$EVENT)
lp = predict(fit, test)
# returns 0.7270601 0.7272526 0.7274083
AUC.cd(surv.train, surv.test, predict(fit), predict(fit, test), c(30, 60, 90))
EDIT2: Another option, survConcordance in the survival package:
fit = fit.surv()
test = get.data(is.train=FALSE)
surv.test = Surv(test$TIME, test$EVENT)
survConcordance(surv.test ~ predict(fit, test), data = test)
# Outputs
n= 428
Concordance= 0.7799616 se= 0.03275571
concordant discordant tied.risk tied.time std(c-d)
23533.00 6639.00 0.00 144.00 1976.61
I'm really not sure about what these lines above are doing, I appreciate any help on this!
Best Answer
This is not what you do to validate an event time model. You need a smooth calibration curve at each of a series of time horizons plus validation of predictive discrimination, e.g., Somers' Dxy rank correlation (c-index). The R
rms
package makes this easy, and it can use the bootstrap to correct for overfitting if you are honest about including all candidate variables in the model. See my course notes for details: http://biostat.mc.vanderbilt.edu/rms