I frequently do survival analysis on large data sets. One million samples or more is typical, and this seems to be much more than typical research usage. Many algorithms I've used are prohibitively slow at this scale.
Fortunately, the coxph
routine in R's survival
package is remarkably fast and scales much better with sample size than other packages. Why is this? Is there a reference for the algorithm implemented in coxph
?
I'd like to understand so that I can investigate numerically efficient methods for time-dependent covariates on large data. Existing methods seem to be very slow (including the time transform functionality in survival
) and I suspect that they may consume much more time and/or memory than necessary.
At the very least, it seems that some delicate algorithmic techniques have been used. Here's a relevant note from the time-dependent covariates vignette:
Although handy, the computational impact of the
tt
argument should
be considered before using it. The Cox model requires computation of a
weighted mean and variance of the covariates at each event time, a
process that is inherently $O(ndp^2)$ where $n =$ the sample size, $d =$ the number of events and $p =$ the number of covariates. Much of the algorithmic effort incoxph()
is to use updating methods for
the mean and variance matrices, reducing the compute time to
$O((n+d)p^2)$. When att
term appears updating is not possible; for
even moderate size data sets the impact of $nd$ versus $n + d$ can be
surprising.
This is a tantalizing hint but I don't know how the overall algorithm works and thus cannot really understand it. I'd prefer a mathematical exposition over wading through the source, and I wasn't able to find one in the references for coxph
in the survival
package manual on CRAN.
Best Answer
Believe it or not, it's just Newton-Raphson. It's right here. The weighted mean and covariance matrices mentioned in the vignette passage are Equations (3.4) through (3.6).