There are many blog posts, YouTube videos, etc. about the ideas of bagging or boosting trees. My general understanding is that the pseudo code for each is:
Bagging:
- Take N random samples of x% of the samples and y% of the features
- Fit your model (e.g., decision tree) on each of N
- Predict with each N
- Average the predictions to get the final prediction
Boosting:
- Fit your model (e.g., decision tree) to your data
- Get the residuals
- Fit your model to the residuals
- Go to 2 for N boosting rounds
- The final prediction is a weighted sum of the sequential predictors.
I'll take any clarifications to my understanding above, but my intended question is as follows:
Both XGBoost and LightGBM have params that allow for bagging. The application is not Bagging OR Boosting (which is what every blog post talks about), but Bagging AND Boosting. What is the pseudo code for where and when the combined bagging and boosting takes place?
I expected it to be "Bagged Boosted Trees", but it seems it is "Boosted Bagged Trees". The difference seems substantial.
Bagged Boosted Trees:
- Take N random samples of x% of the samples and y% of the features
- Fit Boosted trees on each of the N samples
- Predict with each N
- Average the predictions to get the final prediction
This seems like the best way to do it. After all, the risk in boosting is overfitting and the primary benefit of bagging is to reduce overfitting; bagging a bunch of boosted models seems like a great idea.
However, from looking through, for example the scikit-learn
gradient_boosting.py (which does sample bagging, but not random feature selection), and cobbling together some small nuggets across posts about LightGBM and XGBoost, it looks like XGBoost and LightGBM work as follows:
Boosted Bagged Trees:
- Fit a decision tree to your data
- For i in N boosting rounds:
- Get the residuals
- if i mod bag_frequency == 0 (i.e., bag every 5 rounds):
- Take a single random sample of x% of the samples and y% of the features; use this random sample going forward
- fit tree to the residuals
- The final prediction is a weighted sum of the sequential predictors.
Please correct my understanding here and fill in the details. Boosted Bagged Tree (with just 1 random tree per bag_frequency) doesn't seem as powerful as Bagged Boosted Tree.
Best Answer
Instances are repeatedly sub-sampled in Bagging, but not Features. (RandomForests, XGBoost and CatBoost do both):
Include an initialization step in your Boosting pseudo code to get rid of redundancy:
Bagged Boosted Trees (as you call it) is certainly a reasonable Approach, but different from XGBoost or CatBoost:
XGBoost and CatBoost are both based on Boosting and use the entire training data. They also implement bagging by subsampling once in every boosting Iteration:
If you want to stick to "fit model to residuals", then this would be equivalent to "fit model to residuals of data in bootstrap sample".
Further Remarks:
There is no "best way to do it" as you suggest (no free lunch theorem). "Bagged Boosted Trees" might outperform XGBoost on certain data sets.
This line is confusing. Where did you get this from?
This should not be mentioned in your pseudo code. Especially when there are other more important parameters left out (like learning rate in boosting).