Survival Analysis Models – Using Uncensored Data for Time-to-Event Prediction

machine learningsurvivaltime series

Are there any advantages of using survival analysis models like Cox’s proportional hazard model with uncensored data over simple linear regression or other classic ML models? I have data with recurrent events and I try to predict the time of the next event. Data contains about 2000 different subjects and about 60 events per subject. The percentage of censored data (the last event of each subject) is small and I don't think it plays a big role in the prediction.

Best Answer

It depends on how inter-event times are associated with covariate values. If you think that the inter-event times can be linearly related to appropriately modeled covariates, then a linear regression would be fine. There is a method due to Buckley and James that can incorporate right censoring into linear regression.

If you think that the relative hazard of an event among individuals at any time is related to their covariate values but that the baseline hazard the individuals share isn't constant and has an unknown form, then a semi-parametric Cox model could be useful. For such a model it's only the order of events in time that matters, not the absolute time values. That can be an advantage if you don't know the form of the underlying hazard.

If you have specific parametric forms for baseline hazards and associations of covariates with outcomes, then it isn't that much harder to use all the information you have by allowing for right censoring of inter-event times in a parametric survival model. It might not make a big practical difference in your case with many events per individual, provided that your assumption "I don't think it plays a big role in the prediction" holds. If you allow for censoring, however, then you could test your assumption.

It isn't clear from your question how you are defining time = 0 for the first event. If there's a well-defined entry time for each individual and no prior events, fine. If events are happening all along and you happen to enroll an individual into the study between events, then that first inter-event time is also right censored.

Finally, as you are presumably taking the correlations within individuals into account in some way (e.g., random effects, or a "frailty" in a survival model), don't forget that any predictions from your model will be for some "typical" individual having the specified set of covariate values. The usefulness of such predictions will depend on the variance of the random effects relative to what's explained by the covariates.