Discrete Survival – How to Correctly Write Survival Function in Discrete Time Survival Analysis

discrete datahazardsurvival

I have seen several ways to write (and calculate and interpret) a survivor function in discrete time survival analysis and I wonder which is correct or if they both are, but the interpretation and/or setup of the problem is different and I am missing it.

Here is an example.

  • Customers open an account at a bank.
  • Record the month of opening and month of close and the discrete random variable T is in the set {0,1..,23}. After 23, there is right censoring. T denotes the number of month boundaries crossed between open and close. For example:
  • A customer that opens in January 2012 and closes in January 2012 is labeled as 0.
  • A customer that opens in January 2012 and closes in Feb 2012 is
    labeled as 1

An interval is of the form [a,a+1) where a=0,..,A.

The following hazards, h(t), are calculated.

For example, a customer opens and closes in the same month with probability 0.019194.

Question:
What is the proper way to construct the survivor function?

  1. I am seeing that some sources set S(t) as the probability that the event
    will occur AFTER the period t: S(t) = Pr(T>t). This is the yellow column below. It is calculated as S(t) = $\prod_{t=0}^{t}(1-h(t))$
  2. Others will set the first period S(0) = 1 and then continue. This seems to be saying S(t) = Pr(T>=t). It is calculated as S(t) = $\prod_{t=0}^{t}(1-h(t-1))$

In the continuous case, I guess it doesn't matter between Pr(T>t) and Pr(T>=t), but in the discrete case it does.

enter image description here

Best Answer

The answer is that both are used, unfortunately. In the continuous case, you are right the distinction is unimportant. In the discrete case, the interpretation would be slightly different and therefore clarity is important.

In my experience, the most common definition of the survival function is $S(t) = Pr(T>t)$ and so would match your yellow column. This is the one used in the derivation of the Kaplan-Meier estimator: $\hat{S}(t) = \frac{\text{individuals with } T>t}{\text{total individuals}} = \prod_{j=1}^k{(1-\frac{d_j}{r_j})} $ where $d_j$ is the number of events in interval $j$, $r_j$ is the number of individuals at risk in interval $j$, and $\frac{d_j}{r_j} = h(t)$

An important note is that the survival function should start with $S(0) = 1$ if 0 is the first time point, in the absence of left censoring (i.e. assuming no one starts follow-up already having had the event). In the case of your example, I presume that someone who opens and closes a bank account in January 2012 opens the account before they close it; so if time intervals were shortened (for example using weeks or days as the time scale) then S(0) would equal 1 in both cases.

How much the distinction between the two definitions matters may depend on the specific application. The degree of divergence between the two calculations will likely depend on the length of follow-up, the frequency of the event, and the number of ties and how these are considered.

In addition, in many applications we are interested in comparing hazards or survival between two groups rather than the absolute survival or hazard in a specific group. In this case, I think the distinction should be even less important, but I would have to check into that to be sure.

For more detail on survival analysis where $S(t)$ is clearly defined as $Pr(T>t)$, see: Allison: Survival Analysis Using the SAS System

For more detail, with $S(t) = Pr(T>=t)$, see: Collett: Modelling Survival Data in Medical Research, 2nd ed. (note, most of the analytic details will be the same as in Allison, but the interpretations may differ)

Related Question