Solved – Comparing different methods of discrete-time survival analysis

cox-modellogisticmixed modelsurvival

I'm investigating a discrete time survival problem (the units are months and exit times range from month 1 to 36). From looking around so far, it seems like there are a few different types of model that I could apply:

A Cox proportional hazards model with "exact" tie resolution, a.k.a a conditional logistic regression with the strata being the set of subjects alive at each month. This would not automatically give me an estimate of the baseline hazard, but I understand that I could recover one later.
A standard logistic regression with one data point per subject-month, with time represented as a categorical variable (edit: or as Alexis points out I could use some functional form as well). This amounts to a proportional-odds model, or proportional hazards if I use a cloglog link.
A mixed-effects model–like the above, but considering time as a random effect rather than a dummified categorical variable.

I'm interested in predicting the entire survival function for all of my data points, not just understanding the direction and magnitude of covariate effects. I have on the order of 100k subjects and 100 covariates, so I can easily afford the extra 35 parameters for a dummy variable/mixed-effects model.

It seems to me that I should expect these models all to output similar results. In general, when should I prefer one over the other? (Or are there other types of models that I'm missing?)

EDIT: I've preliminarily tried fitting some of them in R and have run into various random segfaults/stack-overflows in the exact Cox model and computational difficulties with a previous mixed-effects model. So I may end up going simply with whichever one doesn't explode on my data! Still, I'd appreciate other considerations.

Best Answer

A Cox proportional hazards model with "exact" tie resolution, a.k.a a conditional logistic regression ...

A standard logistic regression with one data point per subject-month, with time represented as a categorical variable

A conditional logistic regression model and a standard binary regression model with the logistic function and with a categorical variable for time is the same thing. You end with a 36 different intercept terms, 1 intercept and 35 dummy coefficients or something similar depending on how you setup the dummy coding.

It seems to me that I should expect these models all to output similar results. In general, when should I prefer one over the other? (Or are there other types of models that I'm missing?)

It depends on what you want to achieve. If your goal is to say something about the proportional hazards between two observations or odds ratios then the dummy coding approach may be preferable as you make no assumptions about the intercept. You need though to check the assumption of the link function you use (e.g., is proportional hazard assumption justified in the Cox model).

However, you cannot make prediction about the probability of survival in future periods in a model with 36 dummies as you have no model for the intercept in future periods. This is not the case with random effect model where you have a model for the intercept or parametric model for the intercept. Though, you need to justify your assumptions about the distribution of the random effects or the parametric model you have chosen.

EDIT: I've preliminarily tried fitting some of them in R and have run into various random segfaults/stack-overflows ...

You can also checkout the ddhazard function in my package dynamichazard. You can use it to fit a discrete time survival model with a random walk for the intercept and/or the coefficients.

Related Solutions

Solved – Survival analysis: continuous vs discrete time

The choice of the survival model should be guided by the underlying phenomenon. In this case it appears to be continuous, even if the data is collected in a somewhat discrete manner. A resolution of one month would be just fine over a 5-year period. However, the large number of ties at 6 and 12 months makes one wonder wether you really have a 1-month precision (the ties at 0 are expected - that's a special value where relatively lot of deaths actually happen). I am not quite sure what you can do about that as this most likely reflects after-the-fact rounding rather than interval censoring.

Survival Analysis – How to Create a Toy Survival (Time to Event) Data with Right Censoring

It is not clear to me how you generate your event times (which, in your case, might be $<0$) and event indicators:

time = rnorm(n,10,2) 
S_prob = S(time)
event = ifelse(runif(1)>S_prob,1,0)

So here is a generic method, followed by some R code.

Generating survival times to simulate Cox proportional hazards models

To generate event times from the proportional hazards model, we can use the inverse probability method (Bender et al., 2005): if $V$ is uniform on $(0, 1)$ and if $S(\cdot \,|\, \mathbf{x})$ is the conditional survival function derived from the proportional hazards model, i.e. $$ S(t \,|\, \mathbf{x}) = \exp \left( -H_0(t) \exp(\mathbf{x}^\prime \mathbf{\beta}) \vphantom{\Big(} \right) $$ then it is a fact that the random variable $$ T = S^{-1}(V \,|\, \mathbf{x}) = H_0^{-1} \left( - \frac{\log(V)}{\exp(\mathbf{x}^\prime \mathbf{\beta})} \right) $$ has survival function $S(\cdot \,|\, \mathbf{x})$. This result is known as ``the inverse probability integral transformation''. Therefore, to generate a survival time $T \sim S(\cdot \,|\, \mathbf{x})$ given the covariate vector, it suffices to draw $v$ from $V \sim \mathrm{U}(0, 1)$ and to make the inverse transformation $t = S^{-1}(v \,|\, \mathbf{x})$.

Example [Weibull baseline hazard]

Let $h_0(t) = \lambda \rho t^{\rho - 1}$ with shape $\rho > 0$ and scale $\lambda > 0$. Then $H_0(t) = \lambda t^\rho$ and $H^{-1}_0(t) = (\frac{t}{\lambda})^{\frac{1}{\rho}}$. Following the inverse probability method, a realisation of $T \sim S(\cdot \,|\, \mathbf{x})$ is obtained by computing $$ t = \left( - \frac{\log(v)}{\lambda \exp(\mathbf{x}^\prime \mathbf{\beta})} \right)^{\frac{1}{\rho}} $$ with $v$ a uniform variate on $(0, 1)$. Using results on transformations of random variables, one may notice that $T$ has a conditional Weibull distribution (given $\mathbf{x}$) with shape $\rho$ and scale $\lambda \exp(\mathbf{x}^\prime \mathbf{\beta})$.

R code

The following R function generates a data set with a single binary covariate $x$ (e.g. a treatment indicator). The baseline hazard has a Weibull form. Censoring times are randomly drawn from an exponential distribution.

# baseline hazard: Weibull

# N = sample size    
# lambda = scale parameter in h0()
# rho = shape parameter in h0()
# beta = fixed effect parameter
# rateC = rate parameter of the exponential distribution of C

simulWeib <- function(N, lambda, rho, beta, rateC)
{
  # covariate --> N Bernoulli trials
  x <- sample(x=c(0, 1), size=N, replace=TRUE, prob=c(0.5, 0.5))

  # Weibull latent event times
  v <- runif(n=N)
  Tlat <- (- log(v) / (lambda * exp(x * beta)))^(1 / rho)

  # censoring times
  C <- rexp(n=N, rate=rateC)

  # follow-up times and event indicators
  time <- pmin(Tlat, C)
  status <- as.numeric(Tlat <= C)

  # data set
  data.frame(id=1:N,
             time=time,
             status=status,
             x=x)
}

Test

Here is some quick simulation with $\beta = -0.6$:

set.seed(1234)
betaHat <- rep(NA, 1e3)
for(k in 1:1e3)
{
  dat <- simulWeib(N=100, lambda=0.01, rho=1, beta=-0.6, rateC=0.001)
  fit <- coxph(Surv(time, status) ~ x, data=dat)
  betaHat[k] <- fit$coef
}

> mean(betaHat)
[1] -0.6085473

Best Answer

Related Solutions

Solved – Survival analysis: continuous vs discrete time

Survival Analysis – How to Create a Toy Survival (Time to Event) Data with Right Censoring

Related Question