Boosting – Why Learning Rate Is Considered a Regularization Parameter

boostingdata miningmachine learningoverfittingpredictive-models

The learning rate parameter ($\nu \in [0,1]$) in Gradient Boosting shrinks the contribution of each new base model -typically a shallow tree- that is added in the series. It was shown to dramatically increase test set accuracy, which is understandable as with smaller steps, the minimum of the loss function can be attained more precisely.

I don't get why the learning rate is considered a regularization parameter? Citing the Elements of Statistical Learning, section 10.12.1, p.364:

Controlling the number of trees is not the only possible
regularization strategy. As with ridge regression and neural networks,
shrinkage techniques can be employed as well. Smaller values of $\nu$
(more shrinkage) result in larger training risk for the same number of
iterations $M$. Thus, both $\nu$ and $M$ control prediction risk on
the training data.

Regularization means "way to avoid overfitting", so it is clear that the number of iterations $M$ is crucial in that respect (a $M$ that is too high leads to overfitting). But:

Smaller values of $\nu$ (more shrinkage) result in larger training
risk for the same number of iterations $M$.

just means that with low learning rates, more iterations are needed to achieve the same accuracy on the training set. So how does that relate to overfitting?

Best Answer

Suppose you are trying to minimize the objective function via number of iterations. And current value is $100.0$. In given data set, there are no "irreducible errors" and you can minimize the loss to $0.0$ for your training data. Now you have two ways to do it.

  • The first way is "large learning rate" and few iterations. Suppose you can reduce loss by $10.0$ in each iteration, then, in $10$ iterations, you can reduce the loss to $0.0$.

  • The second way would be "slow learning rate" but more iterations. Suppose you can reduce loss by $1.0$ in each iteration and you need $100$ iteration to have 0.0 loss on your training data.

Now think about this: are the two approaches equal? and if not which is better in optimization context and machine learning context?

In optimization literature, the two approaches are the same. As they both converge to optimal solution. On the other hand, in machine learning, they are not equal. Because in most cases we do not make the loss in training set to $0$ which will cause over-fitting.

We can think about the first approach as a "coarse level grid search", and second approach as a "fine level grid search". Second approach usually works better, but needs more computational power for more iterations.

To prevent over-fitting, we can do different things, the first way would be restrict number of iterations, suppose we are using the first approach, we limit number of iterations to be 5. At the end, the loss for training data is $50$. (BTW, this would be very strange from the optimization point of view, which means we can future improve our solution / it is not converged, but we chose not to. In optimization, usually we explicitly add constraints or penalization terms to objective function, but usually not limit number of iterations.)

On the other hand, we can also use second approach: if we set learning rate to be small say reduce $0.1$ loss for each iteration, although we have large number of iterations say $500$ iterations, we still have not minimized the loss to $0.0$.

This is why small learning rate is sort of equal to "more regularizations".

Here is an example of using different learning rate on an experimental data using xgboost. Please check follwoing two links to see what does eta or n_iterations mean.

Parameters for Tree Booster

XGBoost Control overfitting

For the same number of iterations, say $50$. A small learning rate is "under-fitting" (or the model has "high bias"), and a large learning rate is "over-fitting" (or the model has "high variance").

Example of changing learning rate ETA in XGBoost model

PS. the evidence of under-fitting is both training and testing set have large error, and the error curve for training and testing are close to each other. The sign of over-fitting is training set's error is very low and testing set is very high, two curves are far away from each other.