Survival Analysis Forecasting – Predicting Future Survival Probabilities

forecastingrsurvivaltime series

I've used time-series forecasting (ETS, ARIMA, etc.) models but because of the depth of my data I'm exploring using the survival and related packages in R to forecast future period survival probabilities. For example, if we have 24 months of survival data, I'm interested in forecasting months 25-36 using the statistical parameters derived from analyzing months 1-24, using either Kaplan-Meier or a parametric distribution like Weibull. In the form of simulation with multiple simulation paths derived, as I've done with traditional time-series methods. I'm exploring alternatives to the traditional time-series methods because they seem overly-simplistic in light of depth of data I have (similar to what you see in the lung and kidney datasets that are part of the survival package). Please, does anyone know if this type of probabilistic future period forecasting is possible in survival analysis, and know of any simple examples on-line that I can start with?

Best Answer

One obvious solution, as you already alluded to is to use a parametric model (such as Weibull regression). Non-parametric/semi-parametric models like Kaplan-Meier or Cox do in their standard implementation not extrapolate beyond the largest observed event time. There's other implementations that more or less do the same thing, but can extrapolate, e.g. an exponential model with piecewise constant hazard rates can be very similar to Cox regression, but will extrapolate if you just assume the hazard rate of the last interval going forward (similarly, e.g. the brms R package implements a kind of Cox model using M-splines for the baseline hazard function which I think should permit extrapolation).

E.g. if you use R, one obvious option is to penalize Weibull regression fit <- survreg(Surv(time, has_event) ~ 1 + ridge( predictor1, predictor2, ... , theta=1, scale=T), dist = "weibull", data=mydata), where the ridge-penalty theta could be picked using cross-validation. After fitting, the traditional shape parameter is given by 1/fit$scale and the scale parameter for a record predict(object=fit, newdata=newdata, type="lp"), so that you get an estimated survival curve by plotting the Weibull CDF with these parameters, but that ignores uncertainty about the parameters. This gets a lot easier to deal with in the Bayesian version of Weibull regression, which is nicely supported in the brms R package (if you want instead of individual patient time predictions to predict the parameters of the distribution, there's some nuances: see here).

Alternatively, a more machine learning oriented approach would be to e.g. use XGBoost with an accelerated failure time loss, but calibration tends to be poor. However, there's alternatives that try to fix this.

Related Question