Survival Analysis – How to Determine Time at Which Covariates are Obtained

censoringcox-modelsurvival

I'm interested in predicting when a customer will churn, after they've turned a certain age (in this case, age 18, but this could really be any number).

This piece is critical: given the domain being modeled, as someone gets older, their likelihood of churning should increase (this is domain-specific knowledge one would expect to surface in whatever model is produced).

However, so far my attempts at using a Cox Proportional Hazards model–using the approach I've outlined in the diagram below–have yielded the opposite: older people are less likely to churn (i.e. the covariate age has a negative coefficient, where churning is the "event").

I think the issue relates to the point at which I'm pulling the age covariate for each individual: for uncensored records (i.e. those who churned), I pulled their age as of the person's first appearance in the study; for censored records (i.e. those who didn't churn within the study period), my first approaches had me pulling their ages as of each individual's last observation in the study period. To be frank, I'm brand-new to survival analysis and am not entirely sure why I this was my first approach; I, for whatever reason, was under the impression that all covariates should be pulled as of the most recent time period for censored individuals. However, I'm coming to doubt that understanding, given the unintuitive sign attached to my age coefficient.

So my question is: in the dataset/approach described above and in the diagram below, with a variety of potential censor statuses, when should key covariates like age be pulled and attached to the dependent variable ("years until churn")? Should those green Xs for the right-censored records be shifted all the way to the left, instead of calculating age right before censorship/the end of the study?

Example survival dataset

Best Answer

The inconsistency in handling the age predictor between those who churned and those who didn't probably accounts for your unexpected modeled association between age and risk of churning. Altering any predictor based on whether or not there was an event recorded will get you into trouble in survival analysis.

A Cox model is fit based on the covariate values in place for all individuals at risk at each event time. So if you use a larger constant age value for someone who didn't churn than you would have used if she did churn, you are imposing something similar to survivorship bias on your model. In your case, you specified older ages for those who didn't churn than you should have, so it's not surprising that the model was fooled into thinking that a higher age is associated with less risk of churn.

One way to handle age as a predictor is to enter the value at study entry as a covariate for all individuals. In fact, if you code age that way and model age as a simple linear predictor with respect to log-hazard of churning, then the way that Cox models are fit will handle the changing age values over time directly. In that case, you are modeling both age at study entry and current age as the predictor. See Section 5 of the R vignette on time-dependent survival models for an explanation.

If you want to model age more flexibly (e.g., with a regression spline), then you need to decide, based on your subject matter knowledge, whether you want to use age at study entry or current age as the predictor. For the former, just code the age predictor as the age at study entry.

For the latter, you need to structure the data in the extended "counting process" format and treat age as a time-varying covariate, with a separate row for each individual's time interval corresponding to each set of covariate values, including a start and stop time for the interval and an indicator of whether the event occurred at the stop time. The above vignette section explains how to do that.

Related Question