Solved – Why does XGBoost have a learning rate

boostingclassificationregularization

Original Question

Having used XGBoost a fair bit, clearly changing the learning rate dramatically affects the algorithm's performance. That said, I really can't understand the theoretical justification for it. It makes sense in "vanilla" gradient boosting, when you don't make use of second derivatives. Analogously, one does not need a learning rate if using Newton-Raphson to perform optimisation by finding the zeros of the derivative of the cost function.

I thought it might have to do with ensuring that the updates one makes at every step are small, and thus the gradient expansion to second order is valid, but it seems to me like one could achieve the same thing more effectively by regularising?

Also, the XGBoost docs have a theoretical introduction to XGBoost and don't mention a learning rate anywhere (https://xgboost.readthedocs.io/en/latest/tutorials/model.html)

Is it as simple as "it is experimentally observed to improve performance" and if so, is it possible to rationalise post-fact ?

Update: Almost a year on, I thought I'd update my thinking on this and somewhat refine my question

While it might be the case that the need for a learning rate was ascertained experimentally, it seems likely to me that the reason it is necessary, is to do with the fact that XGBOOST assumes the that the total loss $L$ of a classifier consisting of an existing classifier $F_{t}(x)$ plus a new classifier $f_{t+1}(x)$, can be written as a Taylor Expansion of $L$ about $F_{t}(x)$, which requires $f_{t+1}(x)$ to represent a "small enough" correction to $F_{t}(x)$, that we don't need to expand to too high an order.

My suspicion for a while has been that using lots of regularisation should take care of this, hence why use a learning rate at all? Another approach, could be to say that the tree $f_{t+1}(x)$, which splits the space into a number of distinct regions (terminal nodes) $\{R_{j}\}$, outputs a constant $\epsilon \cdot w_{j}$ in the $j^{th}$ region. By choosing a sufficiently small $\epsilon$, we can ensure that $\epsilon \cdot w_{j}$ will be sufficiently small for any partitioning and any j.

It turns out however, that if you follow the derivation in the XGBOOST docs but take this approach, and use no regularisation, the weight $w_{j}^{*}$ you should assign to the region $R_{j}$ is given by

$w_{j}^{*} = – \frac{\sum_{i \in R_{j}}\frac{\partial \ell}{\partial \hat{y}_{i}}\bigg|_{F_{t}(x_{i})}}{\epsilon \sum_{i \in R_{j}}\frac{\partial ^{2}\ell}{\partial \hat{y}_{i}^{2}}}$

in which $L[F_{t}(x)+f_{t+1}(x)] = \sum_{i=1}^{N}\ell (y_{i}, \hat{y}_{i})=\sum_{i=1}^{N}\ell (y_{i}, F_{t}(x_{i}) + f_{t+1}(x_{i}))$

In other words, if you state that the output of each tree at each leaf will be a constant $w_{j}$ multiplied by a very small number $\epsilon$, small enough to ensure that the product is always small, $w_{j}^{*}$ will simply compensate, so the smaller you make $\epsilon$, the larger you make $w_{j}^{*}$, and the product remains unchanged. Crucially, the product will not necessarily be "small enough" for the Taylor series to quickly converge and justify the second order expansion. However, if a little bit of regularisation is used, enough to stop $w_{j}$ becoming infinite and thus ensure the product is always small, then you're good.

In essence, you have two approaches:

  1. Set $\lambda$ to be "very large", this will ensure that $w_{j}^{*}$ is small, and thus the expansion is valid
  2. Use a learning rate parameter $\epsilon$, and have some small amount of regularisation, to ensure that $w_{j}^{*}$ cannot be arbitrarily large

They sound the same, at a phenomenological level, but let's investigate the $w_{j}^{*}$ they imply. Using approach 1, not having a learning rate, we obtain (as in the xgboost docs, linked above)

$w_{j}^{*}= – \frac{\sum_{i \in R_{j}}\frac{\partial \ell}{\partial \hat{y}_{i}}\bigg|_{F_{t}(x_{i})}}{\lambda + \sum_{i \in R_{j}}\frac{\partial^{2} \ell}{\partial \hat{y}_{i}^{2}}\bigg|_{F_{t}(x_{i})}}$

whereas if we use a learning rate as well, we obtain

$w_{j}^{*}= – \frac{\sum_{i \in R_{j}}\frac{\partial \ell}{\partial \hat{y}_{i}}\bigg|_{F_{t}(x_{i})}}{\frac{\lambda}{\epsilon} + \epsilon \cdot \sum_{i \in R_{j}}\frac{\partial^{2} \ell}{\partial \hat{y}_{i}^{2}}\bigg|_{F_{t}(x_{i})}}$

They look very similar, and in both cases, as you up the amount of regularisation by increasing $\lambda$, the curvature term becomes less relevant. In the case when you have a learning rate, you can get this effect either by increasing $\lambda$ or decreasing $\epsilon$.

No matter which way I think about the problem, either approach seems conceptually the same, but they do give slightly different solutions. Furthermore, in practice, the learning rate is perhaps the most important hyperparamter to tune in XGBOOST, although I haven't seen anybody explore whether similarly good results could be obtained by tuning the regularisation parameter more. In particular, am I missing something jumping out at me from these two equations?

Another Update: Another Year On

Thanks to Andreas for his answer below which got me onto this.

Because the loss function is assumed to be approximated by a function quadratic in $w_{j}$, which is valid if $w_{j}$ is small, it will only have one minimum (assuming we're doing loss minimisation). Thus the loss evaluated at $\epsilon \cdot w^{*}_{j}$ will be greater than the loss evaluated at $w^{*}_{j}$, but less than the loss evaluated at $w_{j}=0$, in other words, by updating your prediction by $\epsilon \cdot w^{*}_{j}$, you are guaranteed to decrease your training loss. If $\epsilon$ is very small, this process happens very slowly but if $\epsilon$ is too large, then the Taylor series might not be valid. The key point here is that it's not about finding the optimal $w_{j}$, it's about finding a $w_{j}$ that guarantees the training loss decreases at every iteration.

I think the logic must go something like this, but it can't quite be this. While I agree that if we know $w^{*}_{j}$, then $\epsilon w^{*}_{j}$ will also decrease training loss, but this logic seems circular to me. If we actually knew $w^{*}_{j}$, then while we could multiply by $\epsilon$, why would we?

Conversely, if we want to find the optimal $w_{j}$ subject to the assumption that $w_{j}$ is sufficiently small, it doesn't seem correct to find the optimal $w_{j}$ assuming that $w_{j}$ is small, finding that it isn't small, and then multiplying it by a small number to make it small.

Best Answer

In particular, am I missing something jumping out at me from these two equations?

From what I've looked at in Friedman's paper, the 'learning rate' $\epsilon$ (there, called 'shrinkage' and denoted by $\nu$) is applied after choosing those weights $w_j^*$ which minimise the cost function. That is, we determine the boost's optimal weights, $w_j^*$ first, and only then do we consider multiplying by $\epsilon$.

What would this mean?

This would mean that neither of the equations in the question which feature both $\epsilon$ and $w_j^*$, are used in the XGBoost algorithm.

Also, that $\lambda$ is still necessary in order to guarantee the Taylor expansion validity, and has a non-uniform effect on the $w_j$, its effect depending on the partial derivatives of $\ell$ as you wrote before: \begin{align*} w_{j}^{*}= - \frac{\sum_{i \in R_{j}}\frac{\partial \ell}{\partial \hat{y}_{i}}\bigg|_{F_{t}(x_{i})}}{\lambda + \sum_{i \in R_{j}}\frac{\partial^{2} \ell}{\partial \hat{y}_{i}^{2}}\bigg|_{F_{t}(x_{i})}} \end{align*}

The learning rate doesn't come in until after this point, when, having determined the optimal weights of the new tree $\lbrace w_j^* \rbrace_{j=1}^T$, we decide that, actually, we don't want to add what we've just deemed to be the 'optimal boost' straight-up, but instead, update our additive predictor $F_t$ by adding a scaled version of $f_{t+1}$: scaling each weight $w_j^*$ uniformly by $\epsilon$, and thus scaling the contribution of the whole of $f_{t+1}$ by $\epsilon$, too.

From where I'm sitting, there is some (weak-ish) analogy with the learning rate in gradient descent optimization: gently aggregating the predictors in order to iterate towards what we believe a general and descriptive predictor to be, but maintaining control over how fast we get there. In contrast, a high learning rate will mean that we use up all of our predictive power relatively quickly. If we do so too quickly with too few trees then any subsequent boost might need to make large corrections, causing the loss to remain at a relatively high plateau, after a few steps of which the algorithm terminates.

Keeping a lower learning rate, would aid generalisability because we are relying less upon the predictions of the new boosting tree, and instead permitting subsequent boosts to have more predictive power. It will mean that we need more boosts, and that training will take longer to terminate - in line with the empirical results shown in @Sycorax's answer.

In summary:

My understanding is that:

  1. $\lambda$ is used when regularising the weights $\lbrace w_j\rbrace$ and to justify the 2nd order truncation of the loss function's Taylor expansion, enabling us to find the 'optimal' weights $\lbrace w_j^*\rbrace$. This has a non-uniform effect on each of the weights $w_j$.

  2. $\epsilon$ is used only after determination of the optimal weights $w_j^*$ and applied by scaling all of the weights uniformly to give $\epsilon\, w_j^*$.

Related Question