Solved – Boosting AND Bagging Trees (XGBoost, LightGBM)

baggingboostingcart

There are many blog posts, YouTube videos, etc. about the ideas of bagging or boosting trees. My general understanding is that the pseudo code for each is:

Bagging:

  1. Take N random samples of x% of the samples and y% of the features
  2. Fit your model (e.g., decision tree) on each of N
  3. Predict with each N
  4. Average the predictions to get the final prediction

Boosting:

  1. Fit your model (e.g., decision tree) to your data
  2. Get the residuals
  3. Fit your model to the residuals
  4. Go to 2 for N boosting rounds
  5. The final prediction is a weighted sum of the sequential predictors.

I'll take any clarifications to my understanding above, but my intended question is as follows:

Both XGBoost and LightGBM have params that allow for bagging. The application is not Bagging OR Boosting (which is what every blog post talks about), but Bagging AND Boosting. What is the pseudo code for where and when the combined bagging and boosting takes place?

I expected it to be "Bagged Boosted Trees", but it seems it is "Boosted Bagged Trees". The difference seems substantial.

Bagged Boosted Trees:

  1. Take N random samples of x% of the samples and y% of the features
  2. Fit Boosted trees on each of the N samples
  3. Predict with each N
  4. Average the predictions to get the final prediction

This seems like the best way to do it. After all, the risk in boosting is overfitting and the primary benefit of bagging is to reduce overfitting; bagging a bunch of boosted models seems like a great idea.

However, from looking through, for example the scikit-learn gradient_boosting.py (which does sample bagging, but not random feature selection), and cobbling together some small nuggets across posts about LightGBM and XGBoost, it looks like XGBoost and LightGBM work as follows:

Boosted Bagged Trees:

  1. Fit a decision tree to your data
  2. For i in N boosting rounds:
    • Get the residuals
    • if i mod bag_frequency == 0 (i.e., bag every 5 rounds):
      • Take a single random sample of x% of the samples and y% of the features; use this random sample going forward
    • fit tree to the residuals
  3. The final prediction is a weighted sum of the sequential predictors.

Please correct my understanding here and fill in the details. Boosted Bagged Tree (with just 1 random tree per bag_frequency) doesn't seem as powerful as Bagged Boosted Tree.

Best Answer

Bagging: Take N random samples of x% of the samples and y% of the Features

Instances are repeatedly sub-sampled in Bagging, but not Features. (RandomForests, XGBoost and CatBoost do both):

Given dataset D of size N.
For m in n_models:
    Create new dataset D_i of size N by sampling with replacement from D.
    Train model on D_i (and then predict)
Combine predictions with equal weight 

Include an initialization step in your Boosting pseudo code to get rid of redundancy:

Init data with equal weights (1/N).
For m in n_model:
    Train model on weighted data (and then predict)
    Update weights according to misclassification rate.
    Renormalize weights
Combine confidence weighted predictions

Bagged Boosted Trees (as you call it) is certainly a reasonable Approach, but different from XGBoost or CatBoost:

Given dataset D of size N.
For m in n_models:
    Create new dataset D_i of size N by sampling with replacement from D.
    (Insert Boosting pseudo code here (on D_i))
Combine predictions with equal weight 

XGBoost and CatBoost are both based on Boosting and use the entire training data. They also implement bagging by subsampling once in every boosting Iteration:

Init data with equal weights (1/N).
For m in n_model:
    Train model on weighted bootstrap sample (and then predict)
    Update weights according to misclassification rate.
    Renormalize weights
Combine confidence weighted predictions

If you want to stick to "fit model to residuals", then this would be equivalent to "fit model to residuals of data in bootstrap sample".


Further Remarks:

There is no "best way to do it" as you suggest (no free lunch theorem). "Bagged Boosted Trees" might outperform XGBoost on certain data sets.

Take a single random sample of x% of the samples

This line is confusing. Where did you get this from?

if i mod bag_frequency == 0 (i.e., bag every 5 rounds):

This should not be mentioned in your pseudo code. Especially when there are other more important parameters left out (like learning rate in boosting).