Survival Analysis – Why Random Survival Forest Uses Cumulative Hazard Function to Calculate C Index

concordancehazardrandom forestsurvival

I am working with Random Survival Forests and I found out in the documentation that it uses Sum of Cumulative Hazard Function to define a worse predicted outcome, which is subsequently used to compute C Index.

But, C Index can be easily computed using predicted survival probabilities which is also being computed by Random Survival Forest.

My question is why do they use Cumulative Hazard Function instead of Survival Function for computing c index.
Cumulative Hazard Function is hard to interpret and I am still doubtful about how sum of ensembled cumulative hazard function over all the time points defines a worse predicted outcome.

Any help would be really appreciated!

Best Answer

The C-index is the fraction of pairs of comparable cases in which the observed event times are in the same order as the predicted. If survival curves can cross over time, then the order based on "predicted survival probabilities" can change depending on the choice of time point to calculate the probability. Here's an example of survival curves for 2 groups, one having a standard exponential and one a Weibull with the same median survival (shape, 2; scale, $\sqrt {\log 2}$ in the standard parameterization).

Crossing Survival Curves

If you chose predicted survival probability at Time = 0.4 for comparisons, the order of "predicted" survival between these curves would be opposite from what you would get at Time = 1. Yet with many observations in each group you would find many observed survival times in both groups spanning a range out through Time = 1 or farther. Some type of "predicted outcome" based on a range of survival times, rather than a single time, might be more generalizable.

The cumulative hazard function $H(t)$ has a simple, direct relationship to survival $S(t)$ over (continuous) time: $S(t) = \exp(-H(t))$.* Although many struggle to develop an intuitive sense of the cumulative hazard, it's a very useful summary of the risk that's been accumulated up through time $t$. The out-of-bag (OOB) estimates of cumulative hazard for a case thus provide a reasonable "predicted outcome" value up through that time. The cumulative hazards over time for the two groups above are:

corresponding cumulative hazards over time

Choosing a specific time to compare cumulative hazards leads to the same problem as choosing a specific time to compare survival probabilities. I suspect that those who developed survival random forests considered something like integrating cumulative hazard over a time span of interest to be a better estimate of the "predicted outcome." That's still arbitrary, but perhaps a better choice when there are many groups with different predicted survival curves to compare overall. In this example, if the time span of interest extended out to a bit over Time = 1, the "predicted outcomes" for these 2 groups would then be the same, which makes some sense for groups having the same median survival observed to some reasonable time beyond (but not too far beyond) the median survival.

In response to a comment, there are two different sums involved here. At any particular time, you sum the OOB estimates of cumulative hazard for each case at that time to get the estimate for that case at that time.

For the summary over time to get the "predicted outcome" for each case, the software you cite then adds up those estimates over all event times in the data set. Given the bootstrapping while growing the random forest, that's would seem to provide a good estimate of the situation in the underlying population. The original paper, however, notes (page 847) that other choices for the summation over time are possible.


*As the documentation you linked discusses, this doesn't hold for the ensemble estimates based on averaging over all trees, even if it does within each tree. But the general principle is important: the cumulative hazard is a type of representation of (lack of) survival over time.