Boosting – Impact of Feature Re-Splitting in Gradient Boosting Trees

boosting

If a feature has already split, will it hardly be selected to split again in the subsequent tree in a Gradient Boosting Tree? It is motivated by the fact that for the heavy correlated features in a single tree, usually only one of them will be select to split as their uncertainty will remain few after a splitting. Now in Gradient Boosting Tree, is residual similar with the uncertainty?

Currently I happened to how heavy correlated feature affect the feature importance selected by Gradient Boosting Tree. I guess the result is that Gradient Boosting Tree will only select the importance from one of correlated features just like LASSO.

Best Answer

If a feature has already split, there is a good chance that it will be selected again either within the same tree or in the next trees. The main reason is the following: A tree split is effectively a (recursive) partitioning of the sample space through piece-wise flat functions. Standard tree construction methodologies do not preclude selecting the same feature multiple time so if a particular feature is highly predictive it can be selected again. It has to been noted here that especially when the trees are strongly regularised (e.g. by having a relatively shallow depth) subsequent trees are very likely to reuse a feature $x$ if that is highly predictive.

Regarding the appearance of heavy correlated features: Similar to the case of the LASSO, this correlation leads to instability during variable selection. Assuming a feature $x$ and heavily correlated version $x_{\text{c}}$, there is no reason why features $x$ and $x_{\text{c}}$ won't be used almost interchangeably, given they will have similar cost reduction observed after splitting on each of them as single features. This inability to adequately distinguish between correlated features is actually one of the reason why some variable importance metrics like the number of times a feature is used for splitting or the feature's permutation importance suffer greatly from the occurrence of correlated features; in such instances individual feature importance is attenuated and usually "smeared" across the group of correlated features. This also due to the fact that in quite a few instances we select a subset of our available predictive features when constructing individual trees, as such if $x$ is predictive but not selected but $x_{\text{c}}$ is selected, then our learner will happily use $x_{\text{c}}$ instead of $x$.

So to recap: A given feature $x$ might be use multiple times within a tree and across trees. Highly correlated versions of $x$, $x_{\text{c}}$, are equally likely to be used as $x$ itself, thus making the individual importance of $x$ (and of $x_{\text{c}}$) downward biased. This bias being especially prominent in relation to a third feature $x_1$ that might be not as predictive as $x$ but does not have any sibling features being included in training set).

Related Question