Are there any advantages of using survival analysis models like Cox’s proportional hazard model with uncensored data over simple linear regression or other classic ML models? I have data with recurrent events and I try to predict the time of the next event. Data contains about 2000 different subjects and about 60 events per subject. The percentage of censored data (the last event of each subject) is small and I don't think it plays a big role in the prediction.
Survival Analysis Models – Using Uncensored Data for Time-to-Event Prediction
machine learningsurvivaltime series
Related Solutions
It is not clear to me how you generate your event times (which, in your case, might be $<0$) and event indicators:
time = rnorm(n,10,2)
S_prob = S(time)
event = ifelse(runif(1)>S_prob,1,0)
So here is a generic method, followed by some R code.
Generating survival times to simulate Cox proportional hazards models
To generate event times from the proportional hazards model, we can use the inverse probability method (Bender et al., 2005): if $V$ is uniform on $(0, 1)$ and if $S(\cdot \,|\, \mathbf{x})$ is the conditional survival function derived from the proportional hazards model, i.e. $$ S(t \,|\, \mathbf{x}) = \exp \left( -H_0(t) \exp(\mathbf{x}^\prime \mathbf{\beta}) \vphantom{\Big(} \right) $$ then it is a fact that the random variable $$ T = S^{-1}(V \,|\, \mathbf{x}) = H_0^{-1} \left( - \frac{\log(V)}{\exp(\mathbf{x}^\prime \mathbf{\beta})} \right) $$ has survival function $S(\cdot \,|\, \mathbf{x})$. This result is known as ``the inverse probability integral transformation''. Therefore, to generate a survival time $T \sim S(\cdot \,|\, \mathbf{x})$ given the covariate vector, it suffices to draw $v$ from $V \sim \mathrm{U}(0, 1)$ and to make the inverse transformation $t = S^{-1}(v \,|\, \mathbf{x})$.
Example [Weibull baseline hazard]
Let $h_0(t) = \lambda \rho t^{\rho - 1}$ with shape $\rho > 0$ and scale $\lambda > 0$. Then $H_0(t) = \lambda t^\rho$ and $H^{-1}_0(t) = (\frac{t}{\lambda})^{\frac{1}{\rho}}$. Following the inverse probability method, a realisation of $T \sim S(\cdot \,|\, \mathbf{x})$ is obtained by computing $$ t = \left( - \frac{\log(v)}{\lambda \exp(\mathbf{x}^\prime \mathbf{\beta})} \right)^{\frac{1}{\rho}} $$ with $v$ a uniform variate on $(0, 1)$. Using results on transformations of random variables, one may notice that $T$ has a conditional Weibull distribution (given $\mathbf{x}$) with shape $\rho$ and scale $\lambda \exp(\mathbf{x}^\prime \mathbf{\beta})$.
R code
The following R function generates a data set with a single binary covariate $x$ (e.g. a treatment indicator). The baseline hazard has a Weibull form. Censoring times are randomly drawn from an exponential distribution.
# baseline hazard: Weibull
# N = sample size
# lambda = scale parameter in h0()
# rho = shape parameter in h0()
# beta = fixed effect parameter
# rateC = rate parameter of the exponential distribution of C
simulWeib <- function(N, lambda, rho, beta, rateC)
{
# covariate --> N Bernoulli trials
x <- sample(x=c(0, 1), size=N, replace=TRUE, prob=c(0.5, 0.5))
# Weibull latent event times
v <- runif(n=N)
Tlat <- (- log(v) / (lambda * exp(x * beta)))^(1 / rho)
# censoring times
C <- rexp(n=N, rate=rateC)
# follow-up times and event indicators
time <- pmin(Tlat, C)
status <- as.numeric(Tlat <= C)
# data set
data.frame(id=1:N,
time=time,
status=status,
x=x)
}
Test
Here is some quick simulation with $\beta = -0.6$:
set.seed(1234)
betaHat <- rep(NA, 1e3)
for(k in 1:1e3)
{
dat <- simulWeib(N=100, lambda=0.01, rho=1, beta=-0.6, rateC=0.001)
fit <- coxph(Surv(time, status) ~ x, data=dat)
betaHat[k] <- fit$coef
}
> mean(betaHat)
[1] -0.6085473
I can suggest looking into models for recurrent events with a dependent terminal event. Indeed the terminal event (time to termination of the contract) is a dependent censoring for the recurrent events process, so the usual assumption of "independent censoring" does not hold.
The basic idea in tackling this is to use random effects (frailty). With this, you would have an intensity (hazard) for the recurrent events which depends on the random effect $z_i$: $$ h_i(t|z_i) = z_i h_0(t) \exp(\beta'x_i) $$ and a hazard for the terminal event $$ \lambda_i(t|z_i) = z_i ^\alpha \lambda_0(t) \exp(\gamma'x_i) $$ with $\beta$ and $\gamma$ regression coefficients. This implies that when $\alpha<0$ a high rate of recurrent events is associated with a lower rate of terminal events, and when $\alpha>0$ a high rate of recurrent events is associated with a high rate of terminal events.
This is implemented in the R package frailtypack, which is available on CRAN. There you can specify $h_0$ and $\lambda_0$ in several forms (splines, constant, Weibull, etc). Perhaps this might give you a starting point.
For reading, a good starting point would be chapter 6.6 from The Statistical Analysis of Recurrent Events by Cook and Lawless. Good luck!
Best Answer
It depends on how inter-event times are associated with covariate values. If you think that the inter-event times can be linearly related to appropriately modeled covariates, then a linear regression would be fine. There is a method due to Buckley and James that can incorporate right censoring into linear regression.
If you think that the relative hazard of an event among individuals at any time is related to their covariate values but that the baseline hazard the individuals share isn't constant and has an unknown form, then a semi-parametric Cox model could be useful. For such a model it's only the order of events in time that matters, not the absolute time values. That can be an advantage if you don't know the form of the underlying hazard.
If you have specific parametric forms for baseline hazards and associations of covariates with outcomes, then it isn't that much harder to use all the information you have by allowing for right censoring of inter-event times in a parametric survival model. It might not make a big practical difference in your case with many events per individual, provided that your assumption "I don't think it plays a big role in the prediction" holds. If you allow for censoring, however, then you could test your assumption.
It isn't clear from your question how you are defining
time = 0
for the first event. If there's a well-defined entry time for each individual and no prior events, fine. If events are happening all along and you happen to enroll an individual into the study between events, then that first inter-event time is also right censored.Finally, as you are presumably taking the correlations within individuals into account in some way (e.g., random effects, or a "frailty" in a survival model), don't forget that any predictions from your model will be for some "typical" individual having the specified set of covariate values. The usefulness of such predictions will depend on the variance of the random effects relative to what's explained by the covariates.