Solved – Boosted Regression Trees in R

rregression

As a general statistics question: When using BRTs should you avoid using strongly correlated variables the way you would in say, GLMs or multiple linear regression? Any useful references would be greatly appreciated.

Best Answer

Generally, tree related methods are quite robust to redundant features. Basically, worst-case scenario you will be increasing computing time, but prediction-wise, you'll be quite safe. The problem with GLM's etc. is basically that redundant features can cause overfitting, since the number of parameters increases with the number of features.

Indeed, for a decision tree, if say you duplicate a feature (worst possible case) then if the feature is selected to make a split, the other will never be used below the split (at least not for the same split), as it will never reduce impurity.

Similarly for correlated features, if one feature is chosen to make a split, the other will be less likely to be chosen as there are less chance it will reduce impurity if chosen.

So for gradient boosting you're in the exact same situation. In fact, adding the extra feature can never be cumbersome : when the number of trees increases, and if reasonable shrinkage is selected, eventually both features will be expressed as fully as possible. But they shouldn't "counteract" as it happens in parametric models.

Related Question