Solved – Large scale Cox regression with R (Big Data)

cox-modellogisticrsassurvival

I am trying to run a Cox regression on a sample 2,000,000 row dataset as follows using only R. This is a direct translation of a PHREG in SAS. The sample is representative of the structure of the original dataset.

##
library(survival)

### Replace 100000 by 2,000,000

test <- data.frame(start=runif(100000,1,100), stop=runif(100000,101,300), censor=round(runif(100000,0,1)), testfactor=round(runif(100000,1,11)))

test$testfactorf <- as.factor(test$testfactor)
summ <- coxph(Surv(start,stop,censor) ~ relevel(testfactorf, 2), test)

# summary(summ)
##

user  system elapsed 
9.400   0.090   9.481 

The main challenge is in the compute time for the original dataset (2m rows). As far as I understand, in SAS this could take up to 1 day, … but at least it finishes.

  • Running the example with only 100,000 observations take only 9 seconds. Thereafter the time increases almost quadratically for every 100,000 increment in the number of observations.

  • I have not found any means to parallelize the operation (e.g., we can leverage a 48-core machine if this was possible)

  • Neither biglm nor any package from Revolution Analytics is available for Cox regression, and so I cannot leverage those.

Is there a means to represent this in terms of a logistic regression (for which there are packages in Revolution) or if there are any other alternatives to this problem? I know that they are fundamentally different, but it's the closest I can assume as a possibility given the circumstances.

Best Answer

I run cox regression on a 7'000'000 observation dataset using R and this is not a problem. Indeed, on bivariate models I get the estimates in 52 seconds. I suggest that it is -as often with R- a problem related to the RAM available. You may need at least 12GB to run the model smoothly.