Solved – GBM: impact of the loss function

boosting

I'm familiar with Random Forestand Adaboost and if I understand well the advantage of GBM on these algorithms is we can use different lost functions.

Is it correct to say:

  1. Using square loss means building trees on the raw residual of the previous tree.

  2. Using absolute loss means replacing residuals by theirs signs, ie – 1 for negative residuals and 1 for positive residuals and build the next tree on these transformed residuals.
    This reduces the impact of outliers

If these statements are correct don't you think it's misleading to use the term square loss to describe something which just use the raw residuals? For instance in OLS regression square means what we expect which is residuals are really squared.

Best Answer

Using square loss means building trees on the raw residual of the previous tree.

It is fair, but given your last statement

don't you think it's misleading to use the term square loss to describe something which just use the raw residuals?

I fear you're missing an important point about gradient boosting. In gradient boosting the trees (or whatever weak learners) are not fit to the loss function (the function being minimized), they are fit to the gradient of the loss function with respect to the prediction (*).

That's hard to digest, so here's how it works out in the squared error case. Here the loss function for a single data point is

$$ L(y, \hat y) = (y - \hat y)^2 $$

The $\hat y$ is what I'm calling the prediction. If we treat it like a formal variable and differentiate, we get

$$ \nabla L (y, \hat y) = -2(y - \hat y) $$

It is this function that the trees are fit to, we plug the appropriate values of $y$ and $\hat y$ into $\nabla L$, and the result is our working response for one round of boosting. In the initial boosting iteration we initialize $\hat y$ to a constant (here the mean of the response), and in subsequent iterations to the prediction of the previous rounds of boosting.

Using absolute loss means replacing residuals by their signs, i.e. -1 for negative residuals and 1 for positive residuals and build the next tree on these transformed residuals.

This is mostly correct, but it's good to interpret it in light of the above. For absolute loss we are minimizing the absolute value loss function

$$ L(y, \hat y) = \left| y - \hat y \right| $$

so the trees are fit to

$$ \nabla L (y, \hat y) = -1 \ \text{if} \ (y - \hat y) < 0 $$ $$ \nabla L (y, \hat y) = 1 \ \text{if} \ (y - \hat y) \geq 0 $$

Since the magnitude of the updates is controlled by the gradient at each stage of boosting, you can see that in the squared error case, larger gradients (residuals) give larger updates, but in the absolute case, the update scale is fixed.

(*) These trees are then used to define essentially a direction to move in when minimizing the loss function, ala gradient descent.

Why is it common to mention the loss function and not directly the associated gradient to avoid this confusion when we describe a GBM?

Because the gradient part is an implementation detail. It's an important one, but still an internal part of the algorithm that does not need to be exposed as part of an interface.

The loss function is what is being minimized, while the gradient is how is is being minimized. The first thing is much more important, it needs to be communicated to everyone involved with a model, even the manager of the non-technical department who is still not convinced that $(x + y)^2 \neq x^2 + y^2$.

But, at the end of the day, it's a matter of convention. We tend to name models is accord with the loss function being minimized (absolute error, square error), or an associated distribution (the poisson model minimized the poisson deviance loss function). I'd like to think there are good reasons to do so, but even if you disagree, a good reason to use the traditional names is that people will understand you better that way.