Machine Learning – Can the Same Split Occur in Subsequent Trees in a Gradient Boosted Trees Model?

boostingcartensemble learninggradient descentmachine learning

I am aware that there are several questions about feature/split selection in gradient-boosted trees (e.g. If a feature has already split, will it hardly be selected to split again in the subsequent tree in a Gradient Boosting Tree) but in my opinion, they do not address this concrete problem:

During an investigation of the feature importance of a gradient-boosted tree, I noticed that it seems like the same split is performed in different subsequent models. I am aware that a feature can occur in multiple splits over the whole model (or even in one single tree) in general, but here it is not only the same feature but the same split criterion.

The gradient-boosted tree was trained with the GBM package (wrapped by MLR) in R with the following properties (the best model after cross-validated hyperparameter tuning):

gbm::gbm(formula = f, distribution = "Bernoulli", data = d, n.trees = 549L, 
    interaction.depth = 1L, n.minobsinnode = 45L, shrinkage = 0.2, 
    keep.data = TRUE)
A gradient-boosted model with Bernoulli loss function.
549 iterations were performed.
There were 60 predictors of which 31 had non-zero influence. 

So it is an ensemble of 549 decision stumps.

The data consists of 60 dummy-coded / one-hot-encoded predictors (binary inputs) and a binary target variable (binary classification).

Y X1 X2
A 0 1
A 1 1
B 0 0
B 0 0
B 0 1

So each input feature can only be 0 or 1. Therefore for each feature, there is only one possible split border: 0.5 (the feature is present or not). Therefore there are only 60 different splits available overall (one per input feature) for the model during training.

When investigating the single stumps I noticed that the same split is performed multiple times. E.g. for the first two stumps the same feature (16) is used with the same split:

> pretty.gbm.tree(model, i.tree = 1)
  SplitVar SplitCodePred LeftNode RightNode MissingNode ErrorReduction Weight   Prediction
0       16   0.500000000        1         2           3       11.43972    417  0.008633094
1       -1   0.083544304       -1        -1          -1        0.00000    316  0.083544304
2       -1  -0.225742574       -1        -1          -1        0.00000    101 -0.225742574
3       -1   0.008633094       -1        -1          -1        0.00000    417  0.008633094

 > pretty.gbm.tree(model, i.tree = 2)
  SplitVar SplitCodePred LeftNode RightNode MissingNode ErrorReduction Weight   Prediction
0       16   0.500000000        1         2           3        4.44811    417 -0.005348869
1       -1   0.041234353       -1        -1          -1        0.00000    318  0.041234353
2       -1  -0.154979823       -1        -1          -1        0.00000     99 -0.154979823
3       -1  -0.005348869       -1        -1          -1        0.00000    417 -0.005348869

When looking at the used feature in all 549 stumps it becomes clear, that of course a lot of features are used repeatedly, which in this scenario means the same splits are repeated multiple times.

unlist(lapply(model$trees, function(...) (..1[[1]][[1]])))
  [1] 16 16  0 16 16 23 16 24 56  0 28 11 51 32  0 37 24 19 37 16 27 56 24 10 55 15  2 16  0 40 11 31 24 21 56
 [36] 23 15  0 41 16 23  2 11 19 22 10 56 16 24 40 40 38 28  0 20 48 15 56 48 55 32 28 19 21  2 24 38 56 20 10
 [71] 28 20 51 18 22 32 22 23 24  2 18  2 23 17 23 18  1 11 16 22 21 17 24 24  2 16 17 40 48 51 56  1 22 38 40
[106] 19 41 56 48  2 10 29  2 19  2 27 48 29 22 20 46 46 24  0 23 20 31 41 23 15 48 19 32 27 18 22 21 19 19 11
[141] 41 41 24  0 55 29 23 27 16  1 20 48 20 19 40 16 55  1 41 48 16 41  0 15 37  0 11 16 40 15 10 10 27 13 29
[176] 12 55 37 15  1 28 15 15 37 38 18 19 38 29 11 16 16 27  1  1 31 32 41 24 18 55 19 27 16  2 41 21 22 56 10
[211] 41 18 23 41 10 22 56 56 27 27 29 11 15 38 41 37 23 20 29 10 15  0 40 29 37 11 37 56  2 24 23 28 38 19 38
[246]  1 29 40 48 21 24 20 11 16 10  2 40 23 55 11 32  2 27 28 12  1 27 28 56 18 15 22 12 11 20 55 29  2 56 17
[281] 48 20 41  0 16  2 22 17 21 16 29 41  1 19 18 31 46 20  1 31 51 23 17 23 11 17 18  1 32 23 48 15 21 41 15
[316] 56 16 38 27 24 21 13 46  2 48 19 27 55 48  0 48 48 16 17 56 16 20 48  2 17 23 21 10 55 20 41 28 55 55 37
[351] 21  1 28 55 55 55 56 40 46 28 56 48 15 15 12 37 15 21 48 18 22 40 18 16 55 23 21 40 19 10 40 23 10 10  0
[386] 28 48 10 24  0 48 18 20 31 46 11 10 12 21 17 41 31 24 29 22 15 11 21 55 21 55 31 48 28 11 56 31 55 12 28
[421] 16 48 40  2 24 29 15  2 16 17 48 21 29  1 27  0 56  0 40 23 28  2 17 40 10 56 41 23 56 56 40 17  0 37 24
[456] 20 11 29 23 12 28 23 10 46  2 24 20 10 31 28 41 22 41 41 20 37 40 48 31 16 20 37 22 56 24 55 31 37 22 18
[491] 22 16 55 28 41 28 13 55  0 21 37  0 46 21 31 28 38  2 20 29  2 41 40 22 21 21 11 11 12 28 28 24 15 11  1
[526] 10 18  0 15 11 20 15 48 22 24 19  2 29 29 22 21 41 56 28 40 22 21 11 18

My understanding of gradient boosting machines is that additional models are fitted iteratively based on the residuals of the previous model. This way gradient descent to minimise the loss is performed. Therefore I do not understand how it can improve the model to additively combine stumps with the same splits as done here apparently.

Am I misinterpreting the outputs? If not, how can the same split improve performance further?

My intuitive understanding states that once a split is done, all available information was "squeezed" out of it. Therefore I do not even understand how it makes sense to combine 500+ stumps when only 60 different splits are possible overall.

Best Answer

Yes, gradient boosted decision trees can make the same splits in multiple trees (or even have several identical trees). That may or may not be a good thing.

Let's say there's a single split that truly corresponds to the true underlying data generating process. So, with a low learning rate the best thing XGBoost can do (assuming it can "figure that out" despite any noise in the data) is to have all trees identical and just have that one split in all of them. In this example, it so happens that a low learning rate is not really needed (with a high enough learning rate you could just have a single tree with the one split you need), but if you go with the usual recommendation of just fixing the learning rate to a low value, you can still get a just-as-good model by having lots of identical trees.

That sort of answers your question of

My understanding of gradient boosting machines is briefly, that additional models are fitted iteratively based on the residuals of the previous model.

The answer to that is basically about the learning rate. The final prediction of an XGBoost ensemble of multiple trees is a weighted sum of the predictions of all the trees and the lower the learning rate the less weight each tree has. So, if you have only one tree at the start and that single tree gets everything perfectly right, this still does not count as getting things perfect as far as the residuals are concerned, but if you have many identical trees that would eventually count as getting things (almost) perfectly correct.