Solved – Comparing different methods of discrete-time survival analysis

cox-modellogisticmixed modelsurvival

I'm investigating a discrete time survival problem (the units are months and exit times range from month 1 to 36). From looking around so far, it seems like there are a few different types of model that I could apply:

  • A Cox proportional hazards model with "exact" tie resolution, a.k.a a conditional logistic regression with the strata being the set of subjects alive at each month. This would not automatically give me an estimate of the baseline hazard, but I understand that I could recover one later.

  • A standard logistic regression with one data point per subject-month, with time represented as a categorical variable (edit: or as Alexis points out I could use some functional form as well). This amounts to a proportional-odds model, or proportional hazards if I use a cloglog link.

  • A mixed-effects model–like the above, but considering time as a random effect rather than a dummified categorical variable.

I'm interested in predicting the entire survival function for all of my data points, not just understanding the direction and magnitude of covariate effects. I have on the order of 100k subjects and 100 covariates, so I can easily afford the extra 35 parameters for a dummy variable/mixed-effects model.

It seems to me that I should expect these models all to output similar results. In general, when should I prefer one over the other? (Or are there other types of models that I'm missing?)

EDIT: I've preliminarily tried fitting some of them in R and have run into various random segfaults/stack-overflows in the exact Cox model and computational difficulties with a previous mixed-effects model. So I may end up going simply with whichever one doesn't explode on my data! Still, I'd appreciate other considerations.

Best Answer

A Cox proportional hazards model with "exact" tie resolution, a.k.a a conditional logistic regression ...

A standard logistic regression with one data point per subject-month, with time represented as a categorical variable

A conditional logistic regression model and a standard binary regression model with the logistic function and with a categorical variable for time is the same thing. You end with a 36 different intercept terms, 1 intercept and 35 dummy coefficients or something similar depending on how you setup the dummy coding.

It seems to me that I should expect these models all to output similar results. In general, when should I prefer one over the other? (Or are there other types of models that I'm missing?)

It depends on what you want to achieve. If your goal is to say something about the proportional hazards between two observations or odds ratios then the dummy coding approach may be preferable as you make no assumptions about the intercept. You need though to check the assumption of the link function you use (e.g., is proportional hazard assumption justified in the Cox model).

However, you cannot make prediction about the probability of survival in future periods in a model with 36 dummies as you have no model for the intercept in future periods. This is not the case with random effect model where you have a model for the intercept or parametric model for the intercept. Though, you need to justify your assumptions about the distribution of the random effects or the parametric model you have chosen.

EDIT: I've preliminarily tried fitting some of them in R and have run into various random segfaults/stack-overflows ...

You can also checkout the ddhazard function in my package dynamichazard. You can use it to fit a discrete time survival model with a random walk for the intercept and/or the coefficients.

Related Question