Solved – How to a smaller learning rate hurt the performance of a gbm

boostingmachine learningout-of-sample

I've always subscribed to the folk wisdom that decreasing the learning rate in a gbm (gradient boosted tree model) does not hurt the out of sample performance of the model. Today, I'm not so sure.

I'm fitting models (minimizing sum of squared errors) to the boston housing dataset. Here is a plot of error by number of trees on a 20 percent hold out testing data set

Error by Number of Trees with different learning rates

It's hard to see what's going on at the end, so here's a zoomed in version at the extremes

Zoomed in version

It seems that in this example, the learning rate of $0.01$ is best, with the smaller learning rates performing worse on hold out data.

How is this best explained?

Is this an artifact of the small size of the boston data set? I'm much more familiar with situations where I have hundreds of thousands or millions of data points.

Should I start tuning the learning rate with a grid search (or some other meta-algorithm)?

Best Answer

Yes, you're right a lower learning rate should find a better optimum than a higher learning rate. But you should tune the hyper-parameters using grid search to find the best combination of learning rate along with the other hyper-parameters.

The GBM algorithm uses multiple hyper parameters in addition to the learning rate (shrinkage), these are:

  1. Number of trees
  2. Interaction depth
  3. Minimum observation in a node
  4. Bag fraction (fraction of randomly selected observations)

The grid search needs to check all of these in order to determine the most optimal set of parameters.

For example, on some data-sets I've tuned with GBM, I've observed that accuracy varies widely as each hyper-parameter is changed. I haven't run GBM on your sample data-set, but I'll refer to a similar tuning exercise for another data-set. Refer to this graph on a classification problem with highly imbalanced classes.

Impact of varying shrinkage (learning rate) on Kappa metric

Although the accuracy is highest for lower learning rate, e.g. for max. tree depth of 16, the Kappa metric is 0.425 at learning rate 0.2 which is better than 0.415 at learning rate of 0.35.

But when you look at learning rate at 0.25 vs. 0.26 there is a sharp but small increase in Kappa for max tree depth of 14, 15 and 16; whereas it continues decreasing for tree depth 12 and 13.

Hence, I would suggest you should try the grid search.

Additionally, as you mentioned, this situation could also have been aggravated by a smaller sample size of the data-set.