Solved – Understanding weak learner splitting criterion in gradient boosting decision tree (lightgbm) paper

boostingcartgradientloss-functionsmachine learning

I'm trying to understand the description about gradient boosting in the light-gbm paper as in the picture below.

(Link to paper: https://papers.nips.cc/paper/6907-lightgbm-a-highly-efficient-gradient-boosting-decision-tree.pdf).

In particular, my question refers to the formula in the definition. In gradient boosting trees, at one iteration we obtain the gradient values of our loss function and then fit a regression tree onto this gradient ('weak learner').

In the formula a specific splitting criterion used while building one of these intermediate trees is given. Additionally, in line 6 the authors mention that usually this splitting criterion is used in gradient boosting.

I wonder, where does the formula come from?

enter image description here

Best Answer

After I obtained some help from the authors, I can write down now how I understand it. Somebody jump in, if there is disagreement.

Say, we have some differentiable loss function $L(y,H(x))$ , where $H(x)$ is our tree ensemble at some iteration. Let $g_i$ be the gradient of our loss function at some entry corresponding to observation i.

In each iteration, the gradient is our new label vector on which we fit a regression tree. Like, $\tilde{y_i} := g_i$

Let's only consider the gradient instances belonging to some parent node at some iteration. So, when I write $\forall g_i$ I mean all the instances in this parent node. Let $L = \left\{ g_j | x_{j,s} \leq d \right\}$ and define R similar. Then we search the best variable s with splitting point d for the next split.

Therefore, we choose s and d according to

$ \min_{s,d} \sum_{g_i \in L}^{}(g_i - \bar{g}_L)^2 + \sum_{g_i \in R}^{}(g_i - \bar{g}_R)^2 - \sum_{\forall g_i }^{}(g_i - \bar{g})^2 \\ \quad = \sum_{g_i \in L}^{}g_i^2 - n_L *\bar{g}_L^2 + \sum_{g_i \in R}^{}g_i^2 - n_R *\bar{g}_R^2 - (\sum_{\forall g_i}^{}g_i^2 - n *\bar{g}^2) $

(as $\sum_{g_i \in L}^{}g_i^2 + \sum_{g_i \in R}^{}g_i^2 = \sum_{\forall g_i}^{}g_i^2 $, these terms cancel out)

$\quad = - n_L *\bar{g}_L^2 - n_R *\bar{g}_R^2 + n *\bar{g}^2 $

Now, $n *\bar{g}^2$ is always the same, independent of how we make the split. Hence, for the minimization we can ignore it. Therefore, the minimization from the first line is equivalent to:

$ \min_{s,d}\quad - n_L *\bar{g}_L^2 - n_R *\bar{g}_R^2 $,

which is equivalent to

$ \max_{s,d} \quad n_L *\bar{g}_L^2 + n_R *\bar{g}_R^2 \\ \quad \quad = n_L * (\frac{1}{n_L}\sum_{g_i \in L}g_i)^2 + n_R * (\frac{1}{n_R}\sum_{g_i \in R}g_i)^2 \\ \quad\quad = n_L * \frac{1}{n_L^2} (\sum_{g_i \in L}g_i)^2 + n_R * \frac{1}{n_R^2} (\sum_{g_i \in R}g_i)^2 \\ \quad\quad = \frac{(\sum_{g_i \in L}g_i)^2}{n_L} + \frac{(\sum_{g_i \in R}g_i)^2}{n_R} $

This is almost exactly the formula from the picture but they weight this with the overall number of instances in the parent node. I assume, this is done to compare different splits between different nodes because they use best-first splitting.

Related Question