Boosting – Why Learning Rate Is Considered a Regularization Parameter

boostingdata miningmachine learningoverfittingpredictive-models

The learning rate parameter ($\nu \in [0,1]$) in Gradient Boosting shrinks the contribution of each new base model -typically a shallow tree- that is added in the series. It was shown to dramatically increase test set accuracy, which is understandable as with smaller steps, the minimum of the loss function can be attained more precisely.

I don't get why the learning rate is considered a regularization parameter? Citing the Elements of Statistical Learning, section 10.12.1, p.364:

Controlling the number of trees is not the only possible
regularization strategy. As with ridge regression and neural networks,
shrinkage techniques can be employed as well. Smaller values of $\nu$
(more shrinkage) result in larger training risk for the same number of
iterations $M$. Thus, both $\nu$ and $M$ control prediction risk on
the training data.

Regularization means "way to avoid overfitting", so it is clear that the number of iterations $M$ is crucial in that respect (a $M$ that is too high leads to overfitting). But:

Smaller values of $\nu$ (more shrinkage) result in larger training
risk for the same number of iterations $M$.

just means that with low learning rates, more iterations are needed to achieve the same accuracy on the training set. So how does that relate to overfitting?

Best Answer

Suppose you are trying to minimize the objective function via number of iterations. And current value is $100.0$. In given data set, there are no "irreducible errors" and you can minimize the loss to $0.0$ for your training data. Now you have two ways to do it.

The first way is "large learning rate" and few iterations. Suppose you can reduce loss by $10.0$ in each iteration, then, in $10$ iterations, you can reduce the loss to $0.0$.
The second way would be "slow learning rate" but more iterations. Suppose you can reduce loss by $1.0$ in each iteration and you need $100$ iteration to have 0.0 loss on your training data.

Now think about this: are the two approaches equal? and if not which is better in optimization context and machine learning context?

In optimization literature, the two approaches are the same. As they both converge to optimal solution. On the other hand, in machine learning, they are not equal. Because in most cases we do not make the loss in training set to $0$ which will cause over-fitting.

We can think about the first approach as a "coarse level grid search", and second approach as a "fine level grid search". Second approach usually works better, but needs more computational power for more iterations.

To prevent over-fitting, we can do different things, the first way would be restrict number of iterations, suppose we are using the first approach, we limit number of iterations to be 5. At the end, the loss for training data is $50$. (BTW, this would be very strange from the optimization point of view, which means we can future improve our solution / it is not converged, but we chose not to. In optimization, usually we explicitly add constraints or penalization terms to objective function, but usually not limit number of iterations.)

On the other hand, we can also use second approach: if we set learning rate to be small say reduce $0.1$ loss for each iteration, although we have large number of iterations say $500$ iterations, we still have not minimized the loss to $0.0$.

This is why small learning rate is sort of equal to "more regularizations".

Here is an example of using different learning rate on an experimental data using xgboost. Please check follwoing two links to see what does eta or n_iterations mean.

Parameters for Tree Booster

XGBoost Control overfitting

For the same number of iterations, say $50$. A small learning rate is "under-fitting" (or the model has "high bias"), and a large learning rate is "over-fitting" (or the model has "high variance").

PS. the evidence of under-fitting is both training and testing set have large error, and the error curve for training and testing are close to each other. The sign of over-fitting is training set's error is very low and testing set is very high, two curves are far away from each other.

Related Solutions

Solved – How to find optimal values for the tuning parameters in boosting trees

The caret package in R is tailor made for this.

Its train function takes a grid of parameter values and evaluates the performance using various flavors of cross-validation or the bootstrap. The package author has written a book, Applied predictive modeling, which is highly recommended. 5 repeats of 10-fold cross-validation is used throughout the book.

For choosing the tree depth, I would first go for subject matter knowledge about the problem, i.e. if you do not expect any interactions - restrict the depth to 1 or go for a flexible parametric model (which is much easier to understand and interpret). That being said, I often find myself tuning the tree depth as subject matter knowledge is often very limited.

I think the gbm package tunes the number of trees for fixed values of the tree depth and shrinkage.

Regularization – Why Do Smaller Weights Result in Simpler Models?

If you use regularization you're not only minimizing the in-sample error but $OutOfSampleError \le InSampleError + ModelComplexityPenalty$.

More precisely, $J_{aug}(h(x),y,\lambda,\Omega)=J(h(x),y)+\frac{\lambda}{2m}\Omega$ for a hypothesis $h \in H$, where $\lambda$ is some parameter, usually $\lambda \in (0,1)$, $m$ is the number of examples in your dataset, and $\Omega$ is some penalty that is dependent on the weights $w$, $\Omega=w^Tw$. This is known as the augmented error. Now, you can only minimize the function above if the weights are rather small.

Here is some R code to toy with

w <- c(0.1,0.2,0.3)
out <- t(w) %*% w
print(out)

So, instead of penalizing the whole hypothesis space $H$, we penalize each hypothesis $h$ individually. We sometimes refer to the hypothesis $h$ by its weight vector $w$.

As for why small weights go along with low model complexitity, let's look at the following hypothesis: $h_1(x)=x_1 \times w_1 + x_2 \times w_2 + x_3 \times w_3$. In total we got three active weight parameters ${w_1,\dotsc,w_3}$. Now, let's set $w_3$ to a very very small value, $w_3=0$. This reduces the model's complexity to: $h_1(x)=x_1 \times w_1 + x_2 \times w_2$. Instead of three active weight parameters we only got two remaining.

Best Answer

Related Solutions

Solved – How to find optimal values for the tuning parameters in boosting trees

Regularization – Why Do Smaller Weights Result in Simpler Models?

Related Question