Survival Analysis – Why Random Survival Forest Uses Cumulative Hazard Function to Calculate C Index

concordancehazardrandom forestsurvival

I am working with Random Survival Forests and I found out in the documentation that it uses Sum of Cumulative Hazard Function to define a worse predicted outcome, which is subsequently used to compute C Index.

But, C Index can be easily computed using predicted survival probabilities which is also being computed by Random Survival Forest.

My question is why do they use Cumulative Hazard Function instead of Survival Function for computing c index.
Cumulative Hazard Function is hard to interpret and I am still doubtful about how sum of ensembled cumulative hazard function over all the time points defines a worse predicted outcome.

Any help would be really appreciated!

Best Answer

The C-index is the fraction of pairs of comparable cases in which the observed event times are in the same order as the predicted. If survival curves can cross over time, then the order based on "predicted survival probabilities" can change depending on the choice of time point to calculate the probability. Here's an example of survival curves for 2 groups, one having a standard exponential and one a Weibull with the same median survival (shape, 2; scale, $\sqrt {\log 2}$ in the standard parameterization).

If you chose predicted survival probability at Time = 0.4 for comparisons, the order of "predicted" survival between these curves would be opposite from what you would get at Time = 1. Yet with many observations in each group you would find many observed survival times in both groups spanning a range out through Time = 1 or farther. Some type of "predicted outcome" based on a range of survival times, rather than a single time, might be more generalizable.

The cumulative hazard function $H(t)$ has a simple, direct relationship to survival $S(t)$ over (continuous) time: $S(t) = \exp(-H(t))$.* Although many struggle to develop an intuitive sense of the cumulative hazard, it's a very useful summary of the risk that's been accumulated up through time $t$. The out-of-bag (OOB) estimates of cumulative hazard for a case thus provide a reasonable "predicted outcome" value up through that time. The cumulative hazards over time for the two groups above are:

Choosing a specific time to compare cumulative hazards leads to the same problem as choosing a specific time to compare survival probabilities. I suspect that those who developed survival random forests considered something like integrating cumulative hazard over a time span of interest to be a better estimate of the "predicted outcome." That's still arbitrary, but perhaps a better choice when there are many groups with different predicted survival curves to compare overall. In this example, if the time span of interest extended out to a bit over Time = 1, the "predicted outcomes" for these 2 groups would then be the same, which makes some sense for groups having the same median survival observed to some reasonable time beyond (but not too far beyond) the median survival.

In response to a comment, there are two different sums involved here. At any particular time, you sum the OOB estimates of cumulative hazard for each case at that time to get the estimate for that case at that time.

For the summary over time to get the "predicted outcome" for each case, the software you cite then adds up those estimates over all event times in the data set. Given the bootstrapping while growing the random forest, that's would seem to provide a good estimate of the situation in the underlying population. The original paper, however, notes (page 847) that other choices for the summation over time are possible.

*As the documentation you linked discusses, this doesn't hold for the ensemble estimates based on averaging over all trees, even if it does within each tree. But the general principle is important: the cumulative hazard is a type of representation of (lack of) survival over time.

Related Solutions

Cumulative Hazard Function – Intuition in Survival Analysis

Combining proportions dying as you do is not giving you cumulative hazard. Hazard rate in continuous time is a conditional probability that during a very short interval an event will happen:

$$h(t) = \lim_{\Delta t \rightarrow 0} \frac {P(t<T \le t + \Delta t | T >t)} {\Delta t}$$

Cumulative hazard is integrating (instantaneous) hazard rate over ages/time. It's like summing up probabilities, but since $\Delta t$ is very small, these probabilities are also small numbers (e.g. hazard rate of dying may be around 0.004 at ages around 30). Hazard rate is conditional on not having experienced the event before $t$, so for a population it may sum over 1.

You may look up some human mortality life table, although this is a discrete time formulation, and try to accumulate $m_x$.

If you use R, here's a little example of approximating these functions from number of deaths at each 1-year age interval:

dx <-  c(3184L, 268L, 145L, 81L, 64L, 81L, 101L, 50L, 72L, 76L, 50L, 
         62L, 65L, 95L, 86L, 120L, 86L, 110L, 144L, 147L, 206L, 244L, 
         175L, 227L, 182L, 227L, 205L, 196L, 202L, 154L, 218L, 279L, 193L, 
         223L, 227L, 300L, 226L, 256L, 259L, 282L, 303L, 373L, 412L, 297L, 
         436L, 402L, 356L, 485L, 495L, 597L, 645L, 535L, 646L, 851L, 689L, 
         823L, 927L, 878L, 1036L, 1070L, 971L, 1225L, 1298L, 1539L, 1544L, 
         1673L, 1700L, 1909L, 2253L, 2388L, 2578L, 2353L, 2824L, 2909L, 
         2994L, 2970L, 2929L, 3401L, 3267L, 3411L, 3532L, 3090L, 3163L, 
         3060L, 2870L, 2650L, 2405L, 2143L, 1872L, 1601L, 1340L, 1095L, 
         872L, 677L, 512L, 376L, 268L, 186L, 125L, 81L, 51L, 31L, 18L, 
         11L, 6L, 3L, 2L)

x <- 0:(length(dx)-1) # age vector

plot((dx/sum(dx))/(1-cumsum(dx/sum(dx))), t="l", xlab="age", ylab="h(t)", 
     main="h(t)", log="y")
plot(cumsum((dx/sum(dx))/(1-cumsum(dx/sum(dx)))), t="l", xlab="age", ylab="H(t)", 
     main="H(t)")

Hope this helps.

Survival Analysis – Comprehensive Guide to Hazard Function

Before obtaining the hazard function of $T=\min\{T_1,...,T_n\}$, let's first derive its distribution and its density function, i.e. the CFD and PDF of the first-order statistic from a sample of independently but not identically distributed random variables.

The distribution of the minimum of $n$ independent random variables is

$$F_T(t) = 1-\prod_{i=1}^n[1-F_i(t)]$$

(see the reasoning in this CV post, if you don't know it already)

We differentiate to obtain its density function:

$$f_T(t) =\frac {\partial}{\partial t}F_T(t) = f_1(t)\prod_{i\neq 1}[1-F_i(t)]+...+f_n(t)\prod_{i\neq n}[1-F_i(t)]$$

Using $h_i(t) = \frac {f_i(t)}{(1-F_i(t)} \Rightarrow f_i(t) = h_i(t)(1-F_i(t)) $ and substituting in $f_T(t)$ we have

$$f_T(t) = h_1(t)(1-F_1(t))\prod_{i\neq 1}[1-F_i(t)]+...+h_n(t)(1-F_n(t))\prod_{i\neq n}[1-F_i(t)]$$

$$=\left(\prod_{i=1}^n[1-F_i(t)]\right)\sum_{i=1}^nh_i(t),\;\;\; h_i(t) = \frac {f_i(t)}{1-F_i(t)} \tag{1}$$

which is the density function of the minimum of $n$ independent but not identically distributed random variables.

Then the hazard rate of $T$ is

$$h_T(t) = \frac {f_T(t)}{1-F_T(t)} = \frac {\left(\prod_{i=1}^n[1-F_i(t)]\right)\sum_{i=1}^nh_i(t)}{\prod_{i=1}^n[1-F_i(t)]} = \sum_{i=1}^nh_i(t) \tag{2}$$

Best Answer

Related Solutions

Cumulative Hazard Function – Intuition in Survival Analysis

Survival Analysis – Comprehensive Guide to Hazard Function

Related Question