Solved – How do Newton-Raphson updates work in Gradient Boosting

boostinggradient descent

I want to use the Gradient Boosting algorithm with exponential loss function and I am struggling to understand how to use the Newton-Raphson update step for predictions. In python's sklearn GradientBoostingClassifier the update step is the following:

    numerator = np.sum(y_ * sample_weight * np.exp(-y_ * pred))
    denominator = np.sum(sample_weight * np.exp(-y_ * pred))

    # prevents overflow and division by zero
    if abs(denominator) < 1e-150:
        tree.value[leaf, 0, 0] = 0.0
    else:
        tree.value[leaf, 0, 0] = numerator / denominator

The numerator is the negative sum of the partial first derivatives of the exponential loss function
$$
\ L(pred) = \sum(e^{-y \cdot pred}), \text{ and } \\
\ numerator = \sum\left(-\frac{\partial}{\partial pred} L(pred)\right) = \sum(y \cdot e^{-y \cdot pred}).
$$
The denominator is the sum of the partial second derivatives of the exponential loss function
\begin{align}
\ denominator = \sum \left( \frac{\partial^2}{\partial pred^2} L(pred)\right) = \sum(y^2 \cdot e^{-y \cdot pred}) = \sum(e^{-y \cdot pred}), \text{ since } y^2 = 1.
\end{align}

But according to Newton-Raphson algorithm the update in pred should be:

\begin{align}
\ pred = pred – Hessian(L(pred))^{-1} \cdot \nabla L(pred) \\
\end{align}

where Hessian is the matrix of second-order partial derivatives.

Why does Python sums over the gradient and over the Hessian and then take the ratio of the two as the update step in predictions?

As reference for the Newton-Raphson Algorithm I've been reading the following links:
https://www.stat.washington.edu/adobra/classes/536/Files/week1/newtonfull.pdf
https://math.stackexchange.com/questions/1153655/newtons-method-vs-gradient-descent-with-exact-line-search

Best Answer

Tree-boosting builds ensembles of trees in an iterative way. In every iteration, a new tree is added to the ensemble conditional on the current ensemble of trees in such a way that the empirical risk (aka the training loss) is minimized. Since the latter can usually not be done in closed form, an approximate minimization is done. Broadly speaking, a new tree is found using either functional gradient descent or a functional version of Newton's method, which is conceptually similar to Newton's method in finite dimensions (https://en.wikipedia.org/wiki/Newton%27s_method_in_optimization). Specifically, for Newton boosting, the empirical risk is replaced by a (functional) second order Taylor approximation, and a tree is found by minimizing this approximation. The latter can be done using weighted least squares. Note that for standard regression with a squared error, gradient and Newton boosting are equivalent.

For more details, this preprint https://arxiv.org/abs/1808.03064 explains the difference between Newton and gradient booting.

Disclaimer: I am the author of the article.

Related Question