Solved – Explanation of min_child_weight in xgboost algorithm

The definition of the min_child_weight parameter in xgboost is given as the:

minimum sum of instance weight (hessian) needed in a child. If the
tree partition step results in a leaf node with the sum of instance
weight less than min_child_weight, then the building process will give
up further partitioning. In linear regression mode, this simply
corresponds to minimum number of instances needed to be in each node.
The larger, the more conservative the algorithm will be.

I have read quite a few things on xgboost including the original paper (see formula 8 and the one just after equation 9), this question and most things to do with xgboost that appear on the first few pages of a google search. 😉

Basically I'm still not happy as to why we are imposing a constraint on the sum of the hessian? My only thought at the minute from the original paper is that it relates to the weighted quantile sketch section (and the reformulation as of equation 3 weighted squared loss) which has $h_i$ as the 'weight' of each instance.

A further question relates to why it is simply the number of instances in linear regression mode? I guess this is related to the second derivative of the sum of squares equation?

Best Answer

For a regression, the loss of each point in a node is

$\frac{1}{2}(y_i - \hat{y_i})^2$

The second derivative of this expression with respect to $\hat{y_i}$ is $1$. So when you sum the second derivative over all points in the node, you get the number of points in the node. Here, min_child_weight means something like "stop trying to split once your sample size in a node goes below a given threshold".

For a binary logistic regression, the hessian for each point in a node is going to contain terms like

$\sigma(\hat{y_i})(1 - \sigma(\hat{y_i}))$

where $\sigma$ is the sigmoid function. Say you're at a pure node (e.g., all of the training examples in the node are 1's). Then all of the $\hat{y_i}$'s will probably be large positive numbers, so all of the $\sigma(\hat{y_i})$'s will be near 1, so all of the hessian terms will be near 0. Similar logic holds if all of the training examples in the node are 0. Here, min_child_weight means something like "stop trying to split once you reach a certain degree of purity in a node and your model can fit it".

The Hessian's a sane thing to use for regularization and limiting tree depth. For regression, it's easy to see how you might overfit if you're always splitting down to nodes with, say, just 1 observation. Similarly, for classification, it's easy to see how you might overfit if you insist on splitting until each node is pure.

Best Answer

Related Solutions

Neural Network – Forward Propagation Calculation for Single Layer Neural Network

Solved – XGBoost empirical Hessian of data points for squared loss function

Related Question