Machine Learning – Implications of Scaling Features for XGBoost

boostingmachine learningregularizationridge regression

Doing research about the xgboost algorithm I went through the documentation.

I have heard that xgboost does not care much about the scale of the input features

In this approach trees are regularized using the complexity definition
$$
\Omega(f) = \gamma T + \frac12 \lambda \sum_{j=1}^T w_j^2
$$
where $\gamma$ and $\lambda$ are parameters, $T$ is the number of terminal leaves and $w_j$ is the score in each leaf.

So does this not make it important to scale the features before feeding into xgboost? $\sum_{j=1}^T w_j^2$ term in the regularization part of the cost function is directly influenced by the scale of the features

Best Answer

XGBoost is not sensitive to monotonic transformations of its features for the same reason that decision trees and random forests are not: the model only needs to pick "cut points" on features to split a node. Splits are not sensitive to monotonic transformations: defining a split on one scale has a corresponding split on the transformed scale.

Your confusion stems from misunderstanding $w$. In the section "Model Complexity," the author writes

Here $w$ is the vector of scores on leaves...

The score measures the weight of the leaf. See the diagram in the "Tree Ensemble" section; the author labels the number below the leaf as the "score."

score diagram

The score is also defined more precisely in the paragraph preceding your expression for $\Omega(f)$:

We need to define the complexity of the tree $\Omega(f)$. In order to do so, let us first refine the definition of the tree $f(x)$ as $$f_t(x)=w_{q(x)}, w \in R^T, q:R^d \to {1,2,\dots,T}.$$ Here $w$ is the vector of scores on leaves, $q$ is a function assigning each data point to the corresponding leaf, and $T$ is the number of leaves.

What this expression is saying is that $q$ is a partitioning function of $R^d$, and $w$ is the weight associated with each partition. Partitioning $R^d$ can be done with coordinate-aligned splits, and coordinate-aligned splits are decision trees.

The meaning of $w$ is that it is a "weight" chosen so that the loss of the ensemble with the new tree is lower than the loss of the ensemble without the new tree. This is described in "The Structure Score" section of the documentation. The score for a leaf $j$ is given by

$$ w_j^* = \frac{G_j}{H_j + \lambda} $$

where $G_j$ and $H_j$ are the sums of functions of the partial derivatives of the loss function wrt the prediction for tree $t-1$ for the samples in the $j$th leaf. (See "Additive Training" for details.)