Solved – Tree size in gradient tree boosting

boostingcartr

Gradient tree boosting as proposed by Friedman uses decision trees with J terminal nodes (=leaves) as base learners. There are a number of ways to grow a tree with exactly J nodes for example one can grow the tree in a depth first fashion or in a breadth first fashion, …

Is there an established way how to grow trees with exactly J terminal nodes for gradient tree boosting?

I examined the tree growing procedure of R's gbm package and it seems that it expands the tree in depth-first fashion and uses a heuristic based on error improvement to choose whether to expand the left or the right child node — is that correct?

Best Answer

The solution in R's gbm is not a typical one.

Other packages, like scikit-learn or LightGBM use so-called (in scikit-learn) BestFirstTreeBuilder, when the number of leaves is restricted. It supports a priority queue of all the leaves and at each iteration splits the leaf that brings the best impurity decrease. So it is neither depth-first nor breadth-first, but a third algorithm, based on calculations in the leaves.

In some sense, this approach is more optimal than blindly split all the leaves in turn. However, it is still a greedy heuristic, because the choice whether to split the $i$'th node now depends only on the first split of $i$ and not the possible succesive splits that may decrease impurity much more than the current split.

Related Solutions

Solved – Out-of-bag error estimate for boosting

Answering only partially (and adding a new question to your question).

The gbm implementation in R http://www.rdocumentation.org/packages/gbm/functions/gbm has two parameters to adjust some out-of-bagness.

a) train.fraction will define a proportion of the data that is used to train all trees and thus 1-train.fraction will be true OOB (out-of-bag) data.

b) bag.fraction will define the proportion of training data to be used in the creation of the next tree in the boost. Thus there may be some data that is never used for the creation of any tree and they can be truly used as OOB data.(but it is unlikely, see the question below)

Which brings me to the question. Your analysis of 37% of data as being OOB is true for only ONE tree. But the chance there will be any data that is not used in ANY tree is much smaller - $0.37^{ntrees}$ (it has to be in the OOB for all $ntree$ trees - my understanding is that each tree does its own bootstrap). So in RandomForests it should be very unlikely to be any OOB to test the forest. And yet the randomForest implementation in R (based on Breiman's original code) talks a lot about OOB (for example the result data err.rate and confusion see http://www.rdocumentation.org/packages/randomForest/functions/randomForest)

I dont know how to answer that (and I thank you (+1) for asking the question and making me realize I don't understand this aspect of randomForests). The possible solution is that there is only one bootstrap - and all trees are constructed from it - but as far as I know , it is not the case.

Solved – Reconciling boosted regression trees (BRT), generalized boosted models (GBM), and gradient boosting machine (GBM)

As @aginensky mentioned in the comments thread, it's impossible to get in the author's head, but BRT is most likely simply a clearer description of gbm's modeling process which is, forgive me for stating the obvious, boosted classification and regression trees. And since you've asked about boosting, gradients, and regression trees, here are my plain English explanations of the terms. FYI, CV is not a boosting method but rather a method to help identify optimal model parameters through repeated sampling. See here for some excellent explanations of the process.

Boosting is a type of ensemble method. Ensemble methods refer to a collection of methods by which final predictions are made by aggregating predictions from a number of individual models. Boosting, bagging, and stacking are some widely-implemented ensemble methods. Stacking involves fitting a number of different models individually (of any structure of your own choosing) and then combining them in a single linear model. This is done by fitting the individual models' predictions against the dependent variable. LOOCV SSE is normally used to determine regression coefficients and each model is treated as a basis function (to my mind, this is very, very similar to GAM). Similarly, bagging involves fitting a number of similarly-structured models to bootstrapped samples. At the risk of once again stating the obvious, stacking and bagging are parallel ensemble methods.

Boosting , however, is a sequential method. Friedman and Ridgeway both describe the algorithmic process in their papers so I won't insert it here just this second, but the plain English (and somewhat simplified) version is that you fit one model after the other, with each subsequent model seeking to minimize residuals weighted by the previous model's errors (the shrinkage parameter is the weight allocated to each prediction's residual error from the previous iteration and the smaller you can afford to have it, the better). In an abstract sense, you can think of boosting as a very human-like learning process where we apply past experiences to new iterations of tasks we have to perform.

Now, the gradient part of the whole thing comes from the method used to determine the optimal number of models (referred to as iterations in the gbm documentation) to be used for prediction in order to avoid overfitting. GBM training (black) and CV error (green) loss functions

As you can see from the visual above (this was a classification application, but the same holds true for regression) the CV error drops quite steeply at first as the algorithm selects those models that will lead to the greatest drop in CV error before flattening out and climbing back up again as the ensemble begins to overfit. The optimal iteration number is the one corresponding to the CV error function's inflection point (function gradient equals 0), which is conveniently illustrated by the blue dashed line.

Ridgeway's gbm implementation uses classification and regression trees and while I can't claim to read his mind, I would imagine that the speed and ease (to say nothing of their robustness to data shenanigans) with which trees can be fit had a pretty significant effect on his choice of modeling technique. That being said, while I might be wrong,I can't imagine a strictly theoretical reason why virtually any other modeling technique couldn't have been implemented. Again, I cannot claim to know Ridgeway's mind, but I imagine the generalized part of gbm's name refers to the multitude of potential applications. The package can be used to perform regression (linear, Poisson, and quantile), binomial (using a number of different loss functions) and multinomial classification , and survival analysis (or at least hazard function calculation if the coxph distribution is any indication).

Elith's paper seems vaguely familiar (I think I ran into it last summer while looking into gbm-friendly visualization methods) and, if memory serves right, it featured an extension of the gbm library, focusing on automated model tuning for regression (as in gaussian distribution, not binomial) applications and improved plot generation. I imagine the RBT nomenclature is there to help clarify the nature of the modeling technique, whereas GBM is more general.

Hope this helps clear a few things up.

Best Answer

Related Solutions

Solved – Out-of-bag error estimate for boosting

Solved – Reconciling boosted regression trees (BRT), generalized boosted models (GBM), and gradient boosting machine (GBM)

Related Question