R Survival Analysis – How to Fit Parametric Distribution to Baseline Hazard in Cox PH Models

biostatisticscox-modeleconometricsrsurvival

I develop Cox PH model with time-varying covariates to predict marginal probability of some event. The time frequency is measured in months and I have observations since 3 months to more than 160 months. However, number of observations is too low for later months (more than 130 months) and also events are very rare, if any (you can see survfit plot of my model on the figure below). Due to this fact, baseline hazard curve is too uneven especially for later months. So, I assume, it will be better to smooth baseline and fitting some parametric distribution to it. To extract baseline curve from Cox PH model I use hasehaz function in R. Then I can fit to it parametric distribution (Weibull, lognormal, etc.), however, I have not idea how to link this fitted curve to survival probability predicted by Cox model. I predict it with following code:
predict(cox_model, new_data, type = "survival").
I appreciate any suggestion regarding to merging fitted baseline curve with Cox model prediction.

p.s. I have tried AFT models from flexsurv packages, however, unfortunately, model did not converge. Survfit plot of my model

Best Answer

Several thoughts:

First, even if you could generate a smooth baseline survival curve beyond 130 months, it would be unreliable. That's implicit in the large confidence bands around the survival curve at those times.

Second, the plateau in survival beyond 130 months is at about 85% survival! Furthermore, there are many censoring times along that plateau. It seems that you might need to model this as a "cure" model instead, in which there is "infinite" survival for a substantial fraction of cases. My initial reaction is that your data suggest no hazard of an event beyond about 130 months.

Third, your inability to fit simple accelerated failure time models might be due to the second thought. Such models assume that survival approaches zero at late times. Your data don't.

Fourth, the proportional hazard assumption makes it relatively straightforward to adjust a baseline hazard for covariate values. See this page among many others. For your approach to a smooth baseline curve, you could fit the estimated cumulative hazard $\hat H_0(t)$ for a defined baseline set of covariate values,* adjust the cumulative hazard for differences of covariate values from baseline, then use the relationship $S(t)=- \exp (H(t))$ to get the survival curve.

Fifth, it's possible to fit an arbitrary smooth baseline hazard as part of the modeling process. For example, the flexsurv package allows for the Royston-Parmar flexible spline modeling of baseline hazards; the scale = "hazard" parameter setting does that in a proportional hazards setting. I suspect that's a better approach than yours of first fitting a Cox model and then smoothing the baseline hazard, but I have no experience with it.

Sixth, even after you've dealt with those issues, you still have the fundamental problem of trying to make predictions based on time-varying covariates. This page and its links show the potentially circular reasoning that can be involved; such predictions might not make sense in some settings.

Seventh, none of this gets beyond the problems raised in the first thought.


*The baseline hazard reported for survival models is sometimes for a set of "average" values that might make no sense in practice when there are categorical covariates. Best to choose your own or to make sure that the software makes interpretable choices.

Related Question