CART – Definition of Complexity of a Tree in XGBoost

boostingcartgradient descentoverfittingregularization

Doing research about the xgboost algorithm I went through the documentation.

In this approach trees are regularized using the complexity definition
$$
\Omega(f) = \gamma T + \frac12 \lambda \sum_{j=1}^T w_j^2
$$
where $\gamma$ and $\lambda$ are parameters, $T$ is the number of terminal leaves and $w_j$ is the score in each leaf.

I wonder: how does this define complexity? $T$, the number of terminal nodes, seems natural to me. But the sum of final scores squared?

Maybe overfitting is meant. Meaning that very large scores give too much confidence? Is it chosen to get a weak learner?
What is a natural explanation for this choice of the complexity function?

Best Answer

This makes sense to me.

I'll focus on the Gaussian case. Here each tree $T_i$ is fit on the residuals of the current model, and the model update is $M_{i+1} = M_{i} + \alpha T_i$. The idea of a gradient booster is to carefully and slowly reduce the bias of the model by adding these trees one by one.

In this case, a large value of $w_i$ would correspond to a terminal (leaf) node giving a very large and significant update to the prior model. The idea of the regularization term is to minimize these incidents of large single tree updates (only allowing them if the decrease in the model loss function is large enough to offset the regularization penalty). If such an update is regularized away for a single tree, but turns out to be justified, it will be baked in over multiple model updates, in accordance with the philosophy of boosting.

This is in very close analogy to ridge regression.

Related Question