Solved – Understanding regularization in xgboost

boostingcartmachine learningregularization

A general loss function is:

\begin{split}\text{obj} = \sum_{i=1}^n l(y_i, \hat{y}_i^{(t)}) + \sum_{i=1}^t\Omega(f_i) \\
\end{split}

which is prediction cost + regularization cost

A decision tree is defined as:

$f_t(x) = w_{q(x)}, w \in R^T, q:R^d\rightarrow \{1,2,\cdots,T\} $

Here w is the vector of scores on leaves, q is a function assigning each data point to the corresponding leaf, and T is the number of leaves. In XGBoost, we define the complexity as

$\Omega(f) = \gamma T + \frac{1}{2}\lambda \sum_{j=1}^T w_j^2$

so it seems to me that $w$'s are the final prediction scores for each leaf made by the decision tree. Under this understanding i get that xgboost is penalizing most confident predictions, even if correct, as part of the regularization term in the cost function.

I am not sure it even qualifies as a question as i am looking for someone to tell me if i am reading it right since i have never seen the confidence of a model being penalized under regularization

Also, are not the two parts of the cost function contradictory in some sense? With one part trying to be more confident and the other part (regularization part) in trying to be less confident

Best Answer

The contradiction you noticed is precisely the idea of the regularization: you want to exchange confidence over the training set for confidence over the test set, since being too confident over the training set does not imply the model will generalize well. Indeed, you could be just fitting training set noise. Then, when you penalize weights, you usually end up with a simpler model (imagine that some weights may become zero) that may perform worse on the training set, but performs better over unseen data.

Traditional Gradient Boosting

Consider the traditional Gradient Boosting Algorithm (Wikipedia):

Compute base model $H_0$

For $m \leftarrow 1:M$

Compute pseudo-residuals $r_{im} = -\frac{\partial \ell(y_i, H_{m-1}(x_i))}{\partial H_{m-1}(x_i)}$

Fit a base learner $h_m(x)$ to the pseudo-residuals

Compute the multiplier $\gamma$ that minimizes the cost, $\gamma = \arg \min_\gamma \sum_{i=1}^N \ell(y_i, H_{m-1}(x_i) + \gamma h_m(x_i))$, (using line search)

Update the model $H_m(x) = H_{m-1}(x) + \gamma h_m(x)$.

You get your boosted model $H_M(x)$.

The function approximation is important is for the following part,

Fit a base learner $h_m(x)$ to the pseudo-residuals.

Imagine you where to construct your Gradient Boosting Algorithm naively. You would build the algorithm above using existing regression trees as weak learners. Let assume you are not allowed to tweak the existing implementation of the weak learners. In Matlab, the default split criterion is the Mean Square Error. The same goes for scikit learn.

You are trying to find the best model $h_m(x)$ that minimize the cost $\ell(y_i, H_{m-1}(x_i) + h_m(x_i))$. But to do so, you are fitting a simple regression model to the residuals using the MSE as objective function. Notice that you are not directly minimizing what you want, but using the residuals and the MSE as a proxy to do so. The bad part is that it does not necessarily yields the optimal solution. The good part is that it work.

Traditional Gradient Descent

This is analogous to the traditional Gradient Descent (Wikipedia), where you are trying to minimize a cost function $f(x)$ by following the (negative of the) gradient of the function, $-\nabla f(x)$ at every step.

$$x^{(i+1)} = x^{(i)} - \nabla f(x^{(i)})$$

It does not allow you to find the exact minimum after one step, but each step gets you closer to the minimum (if the function is convex). This is an approximation, but it works very well and it is the algorithm we traditionally use to do a logistic regression, for example.

Interlude

At this point, the thing to understand is that the general gradient boosting algorithm does not compute the cost function $\ell$ for each possible splits, it uses the cost function of the regression weak learner to fit the residuals.

What your question seems to imply is that the "true XGBoost" should compute the cost function for each split, and that the "approximate XGBoost" is using a heuristic to approximate it. You can see it that way, but historically, we have had the general gradient boosting algorithm, which does not use information about the cost function, except the derivative at the current point. XGBoost is an extension to Gradient Boosting that tries to be smarter about growing the weak regression trees by using a more accurate approximation than just the gradient.

Other ways to choose the best model $h_m(x)$

If we take a look at AdaBoost as special case of gradient boosting, it does not selects regressors but classifiers as weak learners. If we set $h_m(x) \in \{-1,1\}$, the way AdaBoost selects the best model is by finding

$$h_m = \arg \max_{h_m} \sum_{i=1}^N w_i h_m(x_i)$$

where $w_i$ are the residuals (source, starts at slide 20). The reasoning for the use of this objective function is that if $w_i$ and $h_m(x_i)$ go in the same direction/have the same sign, the point is moving to the right direction, and you are trying to maximize the maximum amount of movement in the right direction.

But once again, this is not directly measuring which $h_m$ minimizes $\ell(y_i, H_{m-1}(x_i) + h_m(x_i))$. It is measuring how good the move $h_m$ is, with respect with the overall direction you should go, as measured with the residuals $w_i$, which are also an approximation. The residuals tell you in which direction you should be moving by their sign, and roughly by how much by their magnitude, but they do not tell you exactly where you should stop.

Better Gradient Descent

The next three examples are not essential to the explanation and are just here to present some ways to do better than a vanilla gradient descent, to support the idea that what XGBoost does is just another way of improving on gradient descent. In a traditional gradient descent setting, when trying to minimize $f(x)$, it is possible to do better than just following the gradient. Many extensions have been proposed (Wikipedia). Here are some of them, to show that it is possible to do better, given more computation time or more properties of the function $f$.

Line search/Backtracking: In Gradient Descent, once the gradient $-\nabla f(x^{(i)})$ is computed, the next point should be

$$x^{(i+1)} = x^{(i)} - \nabla f(x^{(i)})$$

But the gradient gives only the direction in which one should move, not really by "how much", so another procedure can be used, to find the best $c > 0$ such that

$$x_c^{(i+1)} = x^{(i)} - c \nabla f(x^{(i)})$$

minimizes the cost function. This is done be evaluating $f(x_c^{(i+1)})$ for some $c$, and since the function $f$ should be convex, it is relatively easy to do through Line Search (Wikipedia) or Backtracking Line Search (Wikipedia). Here, the main cost is evaluation $f(x)$. So this extension works best if $f$ is easy to compute. Note that the general algorithm for gradient boosting uses line search, as shown in the beginning of my answer.
Fast proximal gradient method: If the function to minimize is strongly convex, and its gradient is smooth (Lipschitz (Wikipedia)), then there is some trick using those properties that speeds up the convergence.
Stochastic Gradient Descent and the Momentum method: In Stochastic Gradient Descent, you do not evaluate the gradient on all points, but only on a subset of those points. You take a step, then compute the gradient on another batch, and continue on. Stochastic Gradient Descent may be used because the computation on all points is very expensive, or maybe all those points do not even fit into memory. This allows you to take more steps, more quickly, but less accurately.

When doing so, the direction of the gradient might change depending on which points are sampled. To counteract this effect, the momentum methods keeps a moving average of the direction for each dimension, reducing the variance in each move.

The most relevant extension to gradient descent in our discussion of XGBoost is Newton's method (Wikipedia). Instead of just computing the gradient and following it, it uses the second order derivative to gather more information about the direction it should go into. If we use gradient descent, we have that at each iteration, we update our point $x^{(i)}$ as follow,

$$x^{(i+1)} = x^{(i)} - \nabla f(x^{(i)})$$

And since the gradient $\nabla f(x^{(i)})$ points to the direction of highest increase in $f$, its negative points in the direction of highest decrease, and we hope that $f(x^{(i+1)}) < f(x^{(i)})$. This might not hold, as we might go too far in the direction of the gradient (hence the line search extension), but it is a good approximation. In Newton's method, we update $x^{(i)}$ as follow,

$$x^{(i+1)} = x^{(i)} - \frac{\nabla f(x^{(i)})}{\text{Hess} f(x^{(i)})}$$

Where $\text{Hess} f(x)$ is the Hessian of $f$ in $x$. This update takes into account second order information, so the direction is no longer the direction of highest decrease, but should point more precisely towards the $x^{(i+1)}$ such that $f(x^{(i+1)}) = 0$ (or the point where $f$ is minimal, if there is no zero). If $f$ is a second order polynomial, then Newton's method coupled with a line search should be able to find the minimum in one step.

Newton's method contrasts with Stochastic gradient descent. In Stochastic Gradient Descent, we use less point to take less time to compute the direction we should go towards, in order to make more of them, in the hope we go there quicker. In Newton's method, we take more time to compute the direction we want to go into, in the hope we have to take less steps in order to get there.

Now, the reason why Newton's method works is the same as to why the XGBoost approximation works, and it relies on Taylor's expansion (Wikipedia) and Taylor's theorem (Wikipedia). The Taylor expansion (or Taylor series) of a function at a point $f(x + a)$ is

$$f(x) + \frac{\partial f(x)}{\partial x}a + \frac{1}{2}\frac{\partial^2 f(x)}{\partial x^2}a^2 + \cdots = \sum_{n=0} ^\infty \frac{1}{n!} \frac{\partial^n f(x)}{\partial x^n}a^n.$$

Note the similarity between this expression and the approximation XGBoost is using. Taylor's Theorem states that if you stop the expansion at the order $k$, then the error, or the difference between $f(x+a)$ and $\sum_{n=0}^k \frac{1}{n!}\frac{\partial^n f(x)}{\partial x^n}a^n$, is at most $h_k(x) a^k$, where $h_k$ is a function with the nice property that it goes to zero as $a$ goes to zero.

If you want some visualisation of how well it approximate some functions, take a look at the wikipedia pages, they have some graphs for the approximation of non-polynomial function such as $e^x$, $\log(x)$.

The thing to note is that approximation works very well if you want to compute the value of $f$ in the neighbourhood of $x$, that is, for very small changes $a$. This is what we want to do in Boosting. Of course we would like to find the tree that makes the biggest change. If the weak learners we build are very good and want to make a very big change, then we can arbitrarily hinder it by only applying $0.1$ or $0.01$ of its effect. This is the step-size or the learning rate of the gradient descent. This is acceptable, because if our weak learners are getting very good solutions, this means that either the problem is easy, in which case we are going to end up with a good solution anyway, or we are overfitting, so going a little or very much in this bad direction does not change the underlying problem.

So what is XGBoost doing, and why does it work?

XGBoost is a Gradient Boosting algorithm that build regression trees as weak learners. The traditional Gradient Boosting algorithm is very similar to a gradient descent with a line search, where the direction in which to go is drawn from the available weak learners. The naïve implementation of Gradient Boosting would use the cost function of the weak learner to fit it to the residual. This is a proxy to minimize the cost of the new model, which is expensive to compute. What XGBoost is doing is building a custom cost function to fit the trees, using the Taylor series of order two as an approximation for the true cost function, such that it can be more sure that the tree it picks is a good one. In this respect, and as a simplification, XGBoost is to Gradient Boosting what Newton's Method is to Gradient Descent.

Why did they build it that way

Your question as to why using this approximation comes to a cost/performance tradeoff. This cost function is used to compare potential splits for regression trees, so if our points have say 50 features, with an average of 10 different values, each node has 500 potential splits, so 500 evaluation of the function. If you drop a continuous feature, the number of splits explode, and the evaluation of the split is called more and more (XGBoost has another trick to deal with continuous features, but that's out of the scope). As the algorithm will spend most of its time evaluating splits, the way to speed up the algorithm is to speed up tree evaluation.

If you evaluated the tree with the full cost function, $\ell$, it is a new computation for every new split. In order to do optimization in the computation of the cost function, you would need to have information about the cost function, which is the whole point of Gradient Boosting: It should work for every cost function.

The second order approximation is computationally nice, because most terms are the same in a given iteration. For a given iteration, most of the expression can be computed once, and reused as constant for all splits:

$$\mathcal{L}^{(t)}\approx \sum_{i=1}^n \underbrace{\ell(y_i,\hat{y}_i^{(t-1)})}_{\text{constant}}+\underbrace{g_i}_{\text{constant}}f_t(\mathbf{x}_i)+\frac{1}{2}\underbrace{h_i}_{\text{constant}}f_t^2(\mathbf{x}_i)+\Omega(f_t),$$

So the only thing you have to compute is $f_t(x_i)$ and $\Omega(f_t)$, and then what is left is mostly additions, and some multiplications. Moreover, if you take a look at the XGBoost paper (arxiv), you will see that they use the fact that they are building a tree to further simplify the expression down to a bunch of summation of indexes, which is very, very quick.

Summary

You can see XGBoost (with approximation) as a regression from the exact solution, an approximation of the "true XGBoost", with exact evaluation. But since the exact evaluation is so costly, another way to see it is that on huge datasets, the approximation is all we can realistically do, and this approximation is more accurate than the first-order approximation a "naïve" gradient boosting algorithm would do.

The approximation in use is similar to Newton's Method, and is justified by Taylor Series (Wikipedia) and Taylor Theorem (Wikipedia).

Higher order information is indeed not completely used, but it is not necessary, because we want a good approximation in the neighbourhood of our starting point.

For visualisation, check the Wikipedia page of Taylor Series/Taylor's Theorem, or Khan Academy on Taylor Series approximation, or MathDemo page on polynomial approximation of non-polynomials

Best Answer

Related Solutions

CART – Definition of Complexity of a Tree in XGBoost

Solved – XGBoost Loss function Approximation With Taylor Expansion