Solved – What are the final predictions in tree based models

boostingcartmachine learning

I always thought that there are only two ways the predictions from the final leaf are extracted in a tree based model:

  1. For regression problems take the average of the continuous variable to be predicted for the entire node
  2. For the classification problem make the mode of the label node as the final prediction

Then recently i started reading up on xgboost from the original paper by the authors of the algorithm and i realized i didnt quite understand how it is done? I have 2 reasons to believe my understanding is not right entirely. Here is the image for the explanation of CART from paper:

enter image description here

  1. As is seen this is a classification problem (Does the person like the computer games?) and hence the output 2 for a node does not make sense to me

  2. The paper also says $\lambda||w||^2$ is to be added as the regularization term in the cost function. $w$ is the weight of the terminal node (which i understand the final prediction in the node in a regression tree) in which case it does not make much sense to me to have this particular regularization term as it is just penalizing the high value of predictions and not making a model less or more complex

How are the predictions made in CART, what is the final score of a leaf, how is it obtained?

Here is an example.

Final leaves: 3 (classification):

first node: (1,1,1,0,0),
second node: (1,1),
third node: (1,1,1,1)

Final leaves: 3 (Regression):

first node: (0.1,0.2,0.3,100),
second node: (4,4,4),
third node: (50)

what would be the score and w (that is being referred to in regularization term in the paper) for each node in both cases?

Best Answer

As a simplified abstraction, you can think of XGBoost as a special kind of logistic regression, where the "features" are boolean indicators of membership in the terminal nodes of trees. I elaborate on this more in How does gradient boosting calculate probability estimates? If you adopt this perspective of a boolean feature matrix given by some decision tree, then you need to figure out how to set the weights $w$ (i.e. the coefficient vector).

In the section "Model Complexity," the author writes

Here $w$ is the vector of scores on leaves...

The score measures the weight of the leaf. See the diagram in the "Tree Ensemble" section; the author labels the number below the leaf as the "score."

score diagram

The score is also defined more precisely in the paragraph preceding your expression for $\Omega(f)$:

We need to define the complexity of the tree $\Omega(f)$. In order to do so, let us first refine the definition of the tree $f(x)$ as $$f_t(x)=w_{q(x)}, w \in R^T, q:R^d \to {1,2,\dots,T}.$$ Here $w$ is the vector of scores on leaves, $q$ is a function assigning each data point to the corresponding leaf, and $T$ is the number of leaves.

What this expression is saying is that $q$ is a partitioning function of $R^d$, and $w$ is the weight associated with each partition. Partitioning $R^d$ can be done with coordinate-aligned splits, and coordinate-aligned splits are decision trees.

The meaning of $w$ is that it is a "weight" chosen so that the loss of the ensemble with the new tree is lower than the loss of the ensemble without the new tree. This is described in "The Structure Score" section of the documentation. The score for a leaf $j$ is given by

$$ w_j^* = \frac{G_j}{H_j + \lambda} $$

where $G_j$ and $H_j$ are the sums of functions of the partial derivatives of the loss function wrt the prediction for tree $t-1$ for the samples in the $j$th leaf. (See "Additive Training" for details.)

Related Question