Gradient Boosting – Understanding Gradient in Gradient Boosting

boostinggradient

I know the basic overview of how gradient boosting trees work but i am finding it hard to figure out the use of gradient in gradient boosting. My questions may seem stupid but it would be great if anyone can help me out !. So my question is

What exactly is the use of gradient in gradient boosting ?

P.S : It would be even more helpful if someone can explain me with less math

Best Answer

In short answer, the gradient here refers to the gradient of loss function, and it is the target value for each new tree to predict.

Suppose you have a true value $y$ and a predicted value $\hat{y}$. The predicted value is constructed from some existing trees. Then you are trying to construct the next tree which gives a prediction $z$. Then your final prediction will be $\hat{y}+z$. The correct choice of $z$ is $z = y - \hat{y}$. Therefore, you are now constructing trees to predict $y - \hat{y}$.

It turns out this is a special case of gradient boosting when your loss function is $L = \frac{1}{2} (y - \hat{y})^2$, and your prediction target for this new tree is the gradient of this loss function as $y - \hat{y} = - \frac{\partial L}{\partial \hat{y}}$.

In a more formal definition, if you already have a prediction $\hat{y}$, and you are trying to add a new prediction $z$ from a new tree to it, then the loss function can be expanded by Taylor's expansion near $\hat{y}$ as $$L = L_0 + \frac{\partial L}{\partial \hat{y}} z$$ With the spirit of gradient descent, we want $z$ to be along negative gradient direction, hence $z \sim - \frac{\partial L}{\partial \hat{y}}$.

In that way, you set the target response to be predicted by new tree. All that is left to do is to construct a tree, so that the output of the tree on each input data is the negative gradient.