Solved – AdaBoost – why decision stumps instead of trees

adaboostbaggingboostingcart

Since the original AdaBoost article it has been found out that boosting reduces both variance and bias in the classifier (in contrast to bagging, which only reduces variance). Original AdaBoost (and default Scikit-learn implementation) uses decision stumps, i.e. decision trees with depth 1.

Why is that? Why shouldn't we just use full decision trees, or decision trees with some pruning? Weighting samples works for entire trees, after all, and more recent boosting algorithms (e.g. gradient boosting in Scikit-learn, XGBoost, LightGBM) grow deep trees. This would also allow different tree growing algorithms to be boosted in this simple way, instead of only decision stumps.

Best Answer

The reason for using 'stumps' in boosting but full-height trees in random forests is to do with how the aggregation and fitting is done.

In random forests, the trees in the ensemble are fitted independently to independent bootstrap samples, so any error caused by growing the trees too far is independent for each tree and tends to cancel out in the ensemble average.

In boosting, the trees are fitted sequentially, with each one trained on (in some sense) the residuals from the previous classifier. Once a boosted ensemble starts overfitting, it will keep overfitting; the errors won't just cancel out.

For this reason, it's worth having individual trees be short when boosting and tall when bagging. It's not clear that 'stumps' are optimal for boosting -- there are recommendations for trees with, say, 6 leaves to include interactions better -- but that's an explanation for the basic idea.

Related Solutions

Solved – Why Adaboost with Decision Trees

I talked about this in an answer to a related SO question. Decision trees are just generally a very good fit for boosting, much more so than other algorithms. The bullet point/ summary version is this:

Decision trees are non-linear. Boosting with linear models simply doesn't work well.
The weak learner needs to be consistently better than random guessing. You don't normal need to do any parameter tuning to a decision tree to get that behavior. Training an SVM really does need a parameter search. Since the data is re-weighted on each iteration, you likely need to do another parameter search on each iteration. So you are increasing the amount of work you have to do by a large margin.
Decision trees are reasonably fast to train. Since we are going to be building 100s or 1000s of them, thats a good property. They are also fast to classify, which is again important when you need 100s or 1000s to run before you can output your decision.
By changing the depth you have a simple and easy control over the bias/variance trade off, knowing that boosting can reduce bias but also significantly reduces variance. Boosting is known to overfit, so the easy nob to tune is helpful in that regard.

Solved – Boosting AND Bagging Trees (XGBoost, LightGBM)

Bagging: Take N random samples of x% of the samples and y% of the Features

Instances are repeatedly sub-sampled in Bagging, but not Features. (RandomForests, XGBoost and CatBoost do both):

Given dataset D of size N.
For m in n_models:
    Create new dataset D_i of size N by sampling with replacement from D.
    Train model on D_i (and then predict)
Combine predictions with equal weight

Include an initialization step in your Boosting pseudo code to get rid of redundancy:

Init data with equal weights (1/N).
For m in n_model:
    Train model on weighted data (and then predict)
    Update weights according to misclassification rate.
    Renormalize weights
Combine confidence weighted predictions

Bagged Boosted Trees (as you call it) is certainly a reasonable Approach, but different from XGBoost or CatBoost:

Given dataset D of size N.
For m in n_models:
    Create new dataset D_i of size N by sampling with replacement from D.
    (Insert Boosting pseudo code here (on D_i))
Combine predictions with equal weight

XGBoost and CatBoost are both based on Boosting and use the entire training data. They also implement bagging by subsampling once in every boosting Iteration:

Init data with equal weights (1/N).
For m in n_model:
    Train model on weighted bootstrap sample (and then predict)
    Update weights according to misclassification rate.
    Renormalize weights
Combine confidence weighted predictions

If you want to stick to "fit model to residuals", then this would be equivalent to "fit model to residuals of data in bootstrap sample".

Further Remarks:

There is no "best way to do it" as you suggest (no free lunch theorem). "Bagged Boosted Trees" might outperform XGBoost on certain data sets.

Take a single random sample of x% of the samples

This line is confusing. Where did you get this from?

if i mod bag_frequency == 0 (i.e., bag every 5 rounds):

This should not be mentioned in your pseudo code. Especially when there are other more important parameters left out (like learning rate in boosting).

Best Answer

Related Solutions

Solved – Why Adaboost with Decision Trees

Solved – Boosting AND Bagging Trees (XGBoost, LightGBM)

Related Question