Solved – XGBoost empirical Hessian of data points for squared loss function

I would like to understand XGBoost in detail.
In Figure 2 of the paper, 5 instances are displayed together with their gradient and their hessian.

Assuming I want to find the optimal split in the current tree, I need to compute the gradient and the hessian of each observation.
1) Based on the assumption of a squared loss function, how can I do this?

As far as I understand, Gradient Boosting fits the next base learner (regression tree in my case) to the pseudo-residuals. These can be calculated as a result from the most current boosting estimator at step m, so f_{m-1}(x).

Now, in case of XGBoost I do not only need the gradient, which I think are the pseudo residuals, but also the Hessian. So what is the hessian in this case? Is it the sign of the pseudo-residuals?

2) I tried to visualise my computations. Especially, I tried to use the Scoring function provided in the original paper in order to evaluate a possible split. I cannot figure out, what is wrong with them. The first split results in a loss of zero, which seems not to be ok, since it does not reflect a perfect fit. (Do I really have to evaluate the gradients at each x_i first before taking their sum?)

As far as I understood, I have to apply the scoring function (equation (7)) in the original paper for every possible split. In my case, I assume, that there are no trees grown yet, which means, that I am in iteration m=1. I assume, that in m=0, the mean value in the target is the initial guess.

My example was supposed to show the evaluation of the first possible split, namely splitting the first variable in the first split possible. I know, that I can operate two further splits for X1 and three splits for X2, so 6 splits in total and I would have to evaluate all of them with the Scoring function.

However, how can I evaluate the first possible split of the first possible split variable with the scoring function? I know, that the scoring function yield the loss reduction, so I would calculate the Scoring function for every possible split and split the tree at the variable and split point maximising the Score. I guess, that my calculation is not correct yet, so I would really appreciate help!

Thanks in advance!

Edit: This is the paper

Best Answer

I think you're getting confused. In the paper XGBoost: A Scalable Tree Boosting System, the calculations you are referring to are to calculate the optimal weight and whether to make a split in the first place. The gradient and hessian here refer to the first and second derivatives of the loss function. Explicitely you are summing over all trees thus far, and all samples. In practice you would likely just use a batch of samples.

Best Answer

Related Solutions

Solved – Does Xgboost use feature subsetting

XGBoost vs Gradient Boosting Machines – Comparison

Related Question