Bagging:
Take N random samples of x% of the samples and y% of the Features
Instances are repeatedly sub-sampled in Bagging, but not Features. (RandomForests, XGBoost and CatBoost do both):
Given dataset D of size N.
For m in n_models:
Create new dataset D_i of size N by sampling with replacement from D.
Train model on D_i (and then predict)
Combine predictions with equal weight
Include an initialization step in your Boosting pseudo code to get rid of redundancy:
Init data with equal weights (1/N).
For m in n_model:
Train model on weighted data (and then predict)
Update weights according to misclassification rate.
Renormalize weights
Combine confidence weighted predictions
Bagged Boosted Trees (as you call it) is certainly a reasonable Approach, but different from XGBoost or CatBoost:
Given dataset D of size N.
For m in n_models:
Create new dataset D_i of size N by sampling with replacement from D.
(Insert Boosting pseudo code here (on D_i))
Combine predictions with equal weight
XGBoost and CatBoost are both based on Boosting and use the entire training data. They also implement bagging by subsampling once in every boosting Iteration:
Init data with equal weights (1/N).
For m in n_model:
Train model on weighted bootstrap sample (and then predict)
Update weights according to misclassification rate.
Renormalize weights
Combine confidence weighted predictions
If you want to stick to "fit model to residuals", then this would be equivalent to "fit model to residuals of data in bootstrap sample".
Further Remarks:
There is no "best way to do it" as you suggest (no free lunch theorem). "Bagged Boosted Trees" might outperform XGBoost on certain data sets.
Take a single random sample of x% of the samples
This line is confusing. Where did you get this from?
if i mod bag_frequency == 0 (i.e., bag every 5 rounds):
This should not be mentioned in your pseudo code. Especially when there are other more important parameters left out (like learning rate in boosting).
Best Answer
The reason for using 'stumps' in boosting but full-height trees in random forests is to do with how the aggregation and fitting is done.
In random forests, the trees in the ensemble are fitted independently to independent bootstrap samples, so any error caused by growing the trees too far is independent for each tree and tends to cancel out in the ensemble average.
In boosting, the trees are fitted sequentially, with each one trained on (in some sense) the residuals from the previous classifier. Once a boosted ensemble starts overfitting, it will keep overfitting; the errors won't just cancel out.
For this reason, it's worth having individual trees be short when boosting and tall when bagging. It's not clear that 'stumps' are optimal for boosting -- there are recommendations for trees with, say, 6 leaves to include interactions better -- but that's an explanation for the basic idea.