Solved – How to a smaller learning rate hurt the performance of a gbm

boostingmachine learningout-of-sample

I've always subscribed to the folk wisdom that decreasing the learning rate in a gbm (gradient boosted tree model) does not hurt the out of sample performance of the model. Today, I'm not so sure.

I'm fitting models (minimizing sum of squared errors) to the boston housing dataset. Here is a plot of error by number of trees on a 20 percent hold out testing data set

It's hard to see what's going on at the end, so here's a zoomed in version at the extremes

It seems that in this example, the learning rate of $0.01$ is best, with the smaller learning rates performing worse on hold out data.

How is this best explained?

Is this an artifact of the small size of the boston data set? I'm much more familiar with situations where I have hundreds of thousands or millions of data points.

Should I start tuning the learning rate with a grid search (or some other meta-algorithm)?

Best Answer

Yes, you're right a lower learning rate should find a better optimum than a higher learning rate. But you should tune the hyper-parameters using grid search to find the best combination of learning rate along with the other hyper-parameters.

The GBM algorithm uses multiple hyper parameters in addition to the learning rate (shrinkage), these are:

Number of trees
Interaction depth
Minimum observation in a node
Bag fraction (fraction of randomly selected observations)

The grid search needs to check all of these in order to determine the most optimal set of parameters.

For example, on some data-sets I've tuned with GBM, I've observed that accuracy varies widely as each hyper-parameter is changed. I haven't run GBM on your sample data-set, but I'll refer to a similar tuning exercise for another data-set. Refer to this graph on a classification problem with highly imbalanced classes.

Although the accuracy is highest for lower learning rate, e.g. for max. tree depth of 16, the Kappa metric is 0.425 at learning rate 0.2 which is better than 0.415 at learning rate of 0.35.

But when you look at learning rate at 0.25 vs. 0.26 there is a sharp but small increase in Kappa for max tree depth of 14, 15 and 16; whereas it continues decreasing for tree depth 12 and 13.

Hence, I would suggest you should try the grid search.

Additionally, as you mentioned, this situation could also have been aggravated by a smaller sample size of the data-set.

Best Answer

Related Solutions

Machine Learning – Can Overfitting Occur by Training Machine Learning Algorithms Using CV/Bootstrap?

Solved – How to evaluate a clustering/unsupervised learning problem with massive amounts of data, with labels only for a small fraction of points

Related Question