Solved – AdaBoost – why decision stumps instead of trees

adaboostbaggingboostingcart

Since the original AdaBoost article it has been found out that boosting reduces both variance and bias in the classifier (in contrast to bagging, which only reduces variance). Original AdaBoost (and default Scikit-learn implementation) uses decision stumps, i.e. decision trees with depth 1.

Why is that? Why shouldn't we just use full decision trees, or decision trees with some pruning? Weighting samples works for entire trees, after all, and more recent boosting algorithms (e.g. gradient boosting in Scikit-learn, XGBoost, LightGBM) grow deep trees. This would also allow different tree growing algorithms to be boosted in this simple way, instead of only decision stumps.

Best Answer

The reason for using 'stumps' in boosting but full-height trees in random forests is to do with how the aggregation and fitting is done.

In random forests, the trees in the ensemble are fitted independently to independent bootstrap samples, so any error caused by growing the trees too far is independent for each tree and tends to cancel out in the ensemble average.

In boosting, the trees are fitted sequentially, with each one trained on (in some sense) the residuals from the previous classifier. Once a boosted ensemble starts overfitting, it will keep overfitting; the errors won't just cancel out.

For this reason, it's worth having individual trees be short when boosting and tall when bagging. It's not clear that 'stumps' are optimal for boosting -- there are recommendations for trees with, say, 6 leaves to include interactions better -- but that's an explanation for the basic idea.

Related Question