Machine Learning – Adjusted R^2 in Tree Ensembles for Model Evaluation

boostingcartmachine learningmodel-evaluationr-squared

Consider tree ensemble methods such as XGBoost, Lightgbm and/or Catboost.

Is the adj. $R^2$ a valid metric for tree ensembles?

I'm curious because these methods handle factor variables differently. E.g. XGBoost needs some kind of one-hot encoding, Lightgbm tries to unite one-hot encoded variables and Catboost uses an unique handling called ordered target encoding. Without going into details, the aforementioned handling does not extend the feature space since the factors are encoded inside of the factor variable. At least for XGBoost and Catboost this always leads to a different number of variables which favors Catboost in terms of the adj. $R^2$ metric.

Best Answer

No, adjusted $R^2$ is not a valid metric of tree ensembles.

In order to have an "adjusted" $R^2$ we need some agreed concept of degrees of freedom. Tree ensembles don't have one, so the idea of "adjusting for model complexity" in a universal manner is ill-defined. To that extent, different GBM implementations even use different tree-growing strategies thus making even simple "tree-to-tree" comparison's somewhat moot. We should just use a proper cross-validation schema with $R^2$ (if it is a metric relevant to the application at hand).

Related Solutions

Solved – Tree size in gradient tree boosting

The solution in R's gbm is not a typical one.

Other packages, like scikit-learn or LightGBM use so-called (in scikit-learn) BestFirstTreeBuilder, when the number of leaves is restricted. It supports a priority queue of all the leaves and at each iteration splits the leaf that brings the best impurity decrease. So it is neither depth-first nor breadth-first, but a third algorithm, based on calculations in the leaves.

In some sense, this approach is more optimal than blindly split all the leaves in turn. However, it is still a greedy heuristic, because the choice whether to split the $i$'th node now depends only on the first split of $i$ and not the possible succesive splits that may decrease impurity much more than the current split.

Solved – Explanatory power of a Decision Tree

Decision tree usually "overfit" the data (in the sense that every point is assigned to a specific class) if you don't provide them early stopping criterion when you grow the tree. However, you can use out-of-bags estimate to get a pseudo R squared.

Per example, for a regression random forest, R offers an implementation of the pseudo R squared.

Value
An object of class randomForest, which is a list with the following components:
[...]
rsq (regression only) “pseudo R-squared”: 1 - mse / Var(y)

Best Answer

Related Solutions

Solved – Tree size in gradient tree boosting

Solved – Explanatory power of a Decision Tree

Related Question