Solved – Why is coxph() so fast for survival analysis on big data

I frequently do survival analysis on large data sets. One million samples or more is typical, and this seems to be much more than typical research usage. Many algorithms I've used are prohibitively slow at this scale.

Fortunately, the coxph routine in R's survival package is remarkably fast and scales much better with sample size than other packages. Why is this? Is there a reference for the algorithm implemented in coxph?

I'd like to understand so that I can investigate numerically efficient methods for time-dependent covariates on large data. Existing methods seem to be very slow (including the time transform functionality in survival) and I suspect that they may consume much more time and/or memory than necessary.

At the very least, it seems that some delicate algorithmic techniques have been used. Here's a relevant note from the time-dependent covariates vignette:

Although handy, the computational impact of the tt argument should
be considered before using it. The Cox model requires computation of a
weighted mean and variance of the covariates at each event time, a
process that is inherently $O(ndp^2)$ where $n =$ the sample size, $d =$ the number of events and $p =$ the number of covariates. Much of the algorithmic effort in coxph() is to use updating methods for
the mean and variance matrices, reducing the compute time to
$O((n+d)p^2)$. When a tt term appears updating is not possible; for
even moderate size data sets the impact of $nd$ versus $n + d$ can be
surprising.

This is a tantalizing hint but I don't know how the overall algorithm works and thus cannot really understand it. I'd prefer a mathematical exposition over wading through the source, and I wasn't able to find one in the references for coxph in the survival package manual on CRAN.

Solved – Why is coxph() so fast for survival analysis on big data

Best Answer

Related Question

Best Answer

Related Solutions

Solved – Two significant covariates in coxph analysis: exploring whether the two covariates interact significantly

Repeated Measures in R’s coxph() – Detailed Handling

Related Question