Cumulative Hazard Function – Intuition in Survival Analysis

hazardprobabilitysurvival

I'm trying to get intuition for each of the main functions in actuarial science (specifically for the Cox Proportional Hazards Model). Here's what I have so far:

  • $f(x)$: starting at the start time, the probability distribution of when you will die.
  • $F(x)$: just the cumulative distribution. At time $T$, what % of the population will be dead?
  • $S(x)$: $1-F(x)$. At time $T$, what % of the population will be alive?
  • $h(x)$: hazard function. At a given time $T$, of the people still alive, this can be used to estimate how many people will die in the next time interval, or if interval->0, 'instantaneous' death probability.
  • $H(x)$: cumulative hazard. No idea.

What's the idea behind combining hazard values, especially when they are continuous? If we use a discrete example with death rates across four seasons, and the hazard function is as follows:

  • Starting at Spring, everyone is alive, and 20% will die
  • Now in Summer, of those remaining, 50% will die
  • Now in Fall, of those remaining, 75% will die
  • Final season is Winter. Of those remaining, 100% will die

Then the cumulative hazard is 20%, 70%, 145%, 245%?? What does that mean, and why is this useful?

Best Answer

Combining proportions dying as you do is not giving you cumulative hazard. Hazard rate in continuous time is a conditional probability that during a very short interval an event will happen:

$$h(t) = \lim_{\Delta t \rightarrow 0} \frac {P(t<T \le t + \Delta t | T >t)} {\Delta t}$$

Cumulative hazard is integrating (instantaneous) hazard rate over ages/time. It's like summing up probabilities, but since $\Delta t$ is very small, these probabilities are also small numbers (e.g. hazard rate of dying may be around 0.004 at ages around 30). Hazard rate is conditional on not having experienced the event before $t$, so for a population it may sum over 1.

You may look up some human mortality life table, although this is a discrete time formulation, and try to accumulate $m_x$.

If you use R, here's a little example of approximating these functions from number of deaths at each 1-year age interval:

dx <-  c(3184L, 268L, 145L, 81L, 64L, 81L, 101L, 50L, 72L, 76L, 50L, 
         62L, 65L, 95L, 86L, 120L, 86L, 110L, 144L, 147L, 206L, 244L, 
         175L, 227L, 182L, 227L, 205L, 196L, 202L, 154L, 218L, 279L, 193L, 
         223L, 227L, 300L, 226L, 256L, 259L, 282L, 303L, 373L, 412L, 297L, 
         436L, 402L, 356L, 485L, 495L, 597L, 645L, 535L, 646L, 851L, 689L, 
         823L, 927L, 878L, 1036L, 1070L, 971L, 1225L, 1298L, 1539L, 1544L, 
         1673L, 1700L, 1909L, 2253L, 2388L, 2578L, 2353L, 2824L, 2909L, 
         2994L, 2970L, 2929L, 3401L, 3267L, 3411L, 3532L, 3090L, 3163L, 
         3060L, 2870L, 2650L, 2405L, 2143L, 1872L, 1601L, 1340L, 1095L, 
         872L, 677L, 512L, 376L, 268L, 186L, 125L, 81L, 51L, 31L, 18L, 
         11L, 6L, 3L, 2L)

x <- 0:(length(dx)-1) # age vector

plot((dx/sum(dx))/(1-cumsum(dx/sum(dx))), t="l", xlab="age", ylab="h(t)", 
     main="h(t)", log="y")
plot(cumsum((dx/sum(dx))/(1-cumsum(dx/sum(dx)))), t="l", xlab="age", ylab="H(t)", 
     main="H(t)")

Hope this helps.

Related Question