regression – Why Discrete Time Survival Models Are Needed

discrete dataprobabilityregressionsurvival

I have been told that one of the main advantages of Semi-Parametric Models (e.g. Survival Analysis, Cox Proportional Hazards Model) is that these models do not require assuming the response variable (e.g. Survival Times) to follow a specific probability distribution, thus allowing for a higher level of flexibility.

This got me to thinking – suppose if the Survival Times are discrete. As an example, suppose that we only have available to us how many years each patient remained in the study.

Initially, I had thought that if we decided to choose a Semi-Parametric Survival Model (e.g. Cox PH), the presence of discrete survival times should not be a problem. As an example, perhaps the true underlying distribution of survival times might follow some discrete probability distribution such as the Poisson Distribution – but since a Semi-Parametric approach (according to my likely flawed understanding) does not require you to specify a specific probability distribution , in theory it should not matter if the true underlying distribution is continuous or discrete.

However, when reading more about this topic (e.g. https://grodri.github.io/glms/notes/c7s6, https://bmcmedresmethodol.biomedcentral.com/articles/10.1186/s12874-022-01679-6), it appears that there have been different models specifically developed for this task (i.e. discrete survival times).

This slightly confuses me – if one of the main advantages of Semi-Parametric Models is their ability to avoid using a specific probability distribution, how come you can't just use a common Semi-Parametric Model such as Cox-Ph when faced with a discrete probability distribution?

The only thing which comes to mind are "philosophical reasons". Perhaps the Cox-PH model was specifically designed for continuous survival times – that is, the advantages of the Cox PH model come into effect only provided that the true probability distribution of survival times are continuous. Perhaps the Cox PH model was fundamentally not intended to be used with discrete data. Thus, in the case of discrete data, you would want to choose a probability distribution that is defined in a "discrete sense" – that is, the random variable can not have negative values nor non-integer values. Although when faced with discrete data, you might be able to "trick" the computer into using a continuous distribution and proceed with a Cox-PH model, perhaps some problems relating to inference and interpretation may later arise (e.g. negative hazards and survival probabilities …. but I am not entirely sure about this).

Can someone please comment on this?

Best Answer

There are at least two contributing reasons. First, as a number of commenters have indicated, handling tied times in the Cox partial likelihood is inconvenient. Second, there are different ways to end up with discrete survival data, and these have motivated additional research.

So, first: ties. The Cox partial likelihood for continuous data is based on time-point probabilities $P(i\text{ died}|\text{one person died})$. These are nice and easy; the baseline hazard cancels out so we get $$L_i= \frac{\lambda_0(t_i)\exp(z_i\beta)}{\sum_{j\text{ alive}} \lambda_0(t_i)\exp(z_j\beta)}=\frac{\exp(z_i\beta)}{\sum_{j\text{ alive}} \exp(z_j\beta)}$$ For discrete times, the natural extension is $P(\text{exactly these k people died}|\text{k people died})$,which is a sum over exponentially many (in $k$) terms. If $k$ is really large -- say, 10 -- this is no fun at all. There are approximations: the Breslow approximation (Norm Breslow didn't actually like this one) and the Efron approximation. It's not so simple.

Second: why do we have discrete data? It could be interval-censored data: we test people's blood glucose at times $t_k$ and diagnose diabetes if it's too high. It could be grouped data: we round survival times to the nearest day. It could be that the event can only happen at discrete times (cheating on an exam? Sleeping through your alarm?). These lead to different models: if you have proportional hazards in continuous time but you have interval-censored measurement it's not the same as having proportional hazards in discrete time.

So, using the Cox model for discrete-time data doesn't end up making life easier either computationally or in interpretation. It can be worth the effort of using explicit discrete-time models. On the other hand, if your measurement times are the same for everyone and reasonably frequent, pretending that the events happen at the measurement points and that you have proportional hazards actually works pretty well and is fairly common in cohort studies. This paper looks at what happens when measurement times aren't similar.