Solved – Faster xgb.cv for large data set

boostingr

I have a record of data that contains 1.1 millions observations and 14 variables. The response is 0 or 1. It was suggested to me that I use Gradient Boosted Trees to build my logistic model.

Using xgb.cv from xgboost in R, I'm attempting to estimate the best hyperparameters on a holdout of 2/3 of the data. However, the code takes forever to run. It took me 13 hours for learning rate = 0.5, depth = 7, number of folds = 5 and number of trees = 10000. I can't imagine the time it will take to loop over different learning rates and depths.

How could I make the process faster? I guess that reducing the number of trees to 2500 would make sense, based on my error curve. Will reducing the number of folds help? Is it really necessary to do bootstrapping?

My current code looks like this, for reference :

etas = c(0.75,0.5,0.1)
max.depths = c(11,9,7,5,3)
fitAssessmentLst = list()
lstPos = 0
for(eta in etas){
  for(max.depth in max.depths){
    lstPos = lstPos + 1
    x = xgb.cv(params = list(objective="binary:logistic", eta=eta, 
        max.depth=max.depth, nthread=3),
        data = train_data.xgbdm,
        nrounds = 10000,
        prediction = FALSE,
        showsd = TRUE, 
        nfolds = 5,
        verbose = 0,
        print.every.n = 1,
        early.stop.round = NULL
        )
    fitAssessmentLst[[lstPos]] = list(eta = eta, max.depth = max.depth, assessmentTbl = x)
  }
}

Best Answer

It took me 13 hours for learning rate = 0.5, depth = 7, number of folds = 5 and number of trees = 10000.

The quickest & easiest way to make this faster is to use xgboost's early stopping feature, as suggested in the comments. The way it works is that you supply a hold-out set to the model, and specify a positive integer. The integer specifies the number of rounds to continue boosting even if the loss increases on the hold-out data. For example, if you specify 3, boosting rounds will continue if the loss increases for 2 rounds on the hold-out, but then decreases on the third round. But if the loss increases for three consecutive rounds, boosting terminates. (Boosting will also terminate if you exhaust the number of boosting rounds - in your case, 10000.)

The reason that this is useful is that it provides a principled - in the sense that it stops when hold out performance declines - way for you to terminate training without having to specify the number of boosting rounds ahead of time. The procedure is a little arbitrary in that choosing different integers can dramatically change the number of boosting rounds.

The reason that you specify an integer as part of the procedure is that it's often true that the loss on the holdout data will transiently increase before decreasing again. This pattern persists until the loss consistently increases, where "consistent" is defined in the rather ad hoc manner of the positive integer.

It's not a perfect system, but it works well enough in practice.

As an aside, with a learning rate as high as 0.5, it's often true that early stopping will terminate training in maybe a dozen rounds of boosting. The sweet spot for the learning rate is somewhere between 0.05 and 0.2 in my experience. A decrease in the learning rate often increases the number of boosting rounds (when you're using early stopping) and yields an improved out-of-sample loss. So there's a little bit of a trade-off between how long you want to wait for results, and how many trees you want (more trees makes the model size larger, so it's not always a trivial concern) and how good the model is.


To answer the titular question, I've heard from colleagues that LightGBM is a faster gradient boosted tree package. I don't have any definitive evidence to substantiate that claim.

Related Question