Survival Probability – Predicting Survival Probability or Time in Datasets Using Predictive Models

predictive-modelssurvivaltime-varying-covariate

I have a longitudinal dataset comprising of physicians and their time independent covariates (age group, physician type, etc) and time dependent covariates (number of patients, hours worked, etc). I have several entries for each physician where the time-dependent covariates change from month to month (I have monthly intervals). This data is formatted in count process format.

I want to predict the risk/survival of a physician leaving. I am unsure how to do this because:

COXPH can handle the time-dependent covariates and does model the effect of the variables on the hazard but can't predict the survival out of the observation window.

Classical machine learning can predict/regress a physician leaving but may not model the time aspect very well (treats each sample as an independent sample, no relation to group of samples for a physician)

I am wondering if these approaches are correct?

  1. Would it be possible to extract the hazard from the coxph (or predict partial hazard) and keep this hazard constant until the survival probability reaches 0?

  2. Could I train a classical model such as xgboost or random forest to predict if a physician is going to leave before x months and include rolling averages on the time-dependent covariates to relate the different observation within a physicians group of observations?
    EDIT: The input to this classical ML model would be my all my features (time-independent + time-dependent) and the output is physician did not leave/left ([0,1]). This Binary output is generated from doing term_date-study_start_date < num_days. This num of days could be 6 months. Meaning if at some point a person leaves we categorize this persons last 6 months as 1 (left) and this is what we are trying to predict. I am proposing to compute rolling averages for each of the time varying covariates and adding these averages as a new feature in the dataset.

  3. Another approach is to do the same as 2 but instead regress the term date. I can convert the term_date into days since beginning of study and regress this. I can then use this model to predict when they will leave and threshold on this. I think this would be a great feat but it may be difficult to do.

Essentially I would like to be able to predict either when a physician will leave (survival curve is ok here or a regressed value) or if they leave before a certain time interval for physicians that include time varying covariates. The above are some approaches I have thought about but I am not 100% sure if they are the correct way. How can I accomplish this?

Best Answer

As Therneau and Grambsch put it (page 272):

Survival curves based on a time-dependent covariate must be used with extreme caution.

My answer here covers many of the general problems in working with models based on time-varying covariates and making predictions from them. The warnings about potentially violating causality and modeling a covariate that changes suddenly and represents the "last straw" in a decision to leave seem particularly applicable here.

With respect to your questions:

Would it be possible to extract the hazard from the coxph (or predict partial hazard) and keep this hazard constant [beyond the observed survival window] until the survival probability is 0?

In a Cox model the hazard is identically 0 except at event times observed in the original data. So, no. You could consider extrapolation to later times with a parametric survival model, but predictions beyond the range of observations is always risky.

Could I train a classical model such as xgboost or random forest to predict if a physician is going to leave before x months ...

Old fogies like me don't think of xgboost or random forests as "classical" models, but this could be done with standard binomial regressions, too. The trick would be how to incorporate the time-varying covariates properly into a model that works with a fixed time window. I'm not sure how to do that.

You could, however, put a short time window together with a survival model incorporating time-varying covariates, as you seem to be interested in relatively short-term decisions to leave. That would move away from the individual-specific time origin, implicit in the survival modeling of your first question, and go to a panel-data approach with a fixed study start date as time = 0 for all individuals then employed, as in your second question. You then include things like prior length of service at study start as covariates in a survival model restricted to a reasonably short period (say, for just a few years) after the common study start date. That would avoid the difficulties of trying to extrapolate to long individual survival times that led in part to your question.

My answer here outlines such an approach, based on a thesis that modeled customer churn in the insurance industry (pretty similar in concept to your situation). The modeling in that thesis allowed for time-varying covariates. What you would be modeling is something different from the individual-survival analysis often done in clinical studies but might be more directly applicable to your business needs.

Another approach is to do the same as 2 but instead regress the term date.

The problem with this approach is that you omit the cases whose "term date" hasn't been reached. That introduces bias. Survival analysis is needed to help avoid that bias in regressions of times-to-events when some events, even if inevitable, haven't occurred during the study.

A final warning: using pre-Covid data on clinician turnover to model behavior either during the current pandemic or in a hoped-for thereafter seems to be pretty risky. Proceed "with extreme caution" on that account, too.