Machine Learning – Boosting Reduces Bias Compared to Which Algorithm?

adaboostbaggingboostingbootstrapmachine learning

I am reading on bagging and boosting, and I understand how they both work (at least I think I do). I would like to talk in the context of decision tree ensembles as I think (not sure if correct) that these are the most used when it comes to bagging and boosting.

It is said that bagging reduces variance and boosting reduces bias.

Now, I understand why bagging would reduce variance of a decision tree algorithm, since on their own, decision trees are low bias high variance, and when we make an ensemble of them with bagging, we reduce the variance as we now spread the vote (classification) or average over (regression) all of them. (Could somebody please point out if this is incorrect).

But I don't understand why it is said that boosting reduces bias. Exactly of what algorithm it reduces the bias of? Because if we compare it to a decision tree, then surely the decision tree can have 0 bias when we classify each point and have 0 error, so how does it reduce the bias? If we compare it to a bagging algorithm, then it does actually have less bias than the bagging algorithm since now we're incorporating the whole dataset and we're also focusing on all data points incorrectly classified.

So when we say that boosting reduces bias, do we say this when we compare it to the bagging algorithm or to something else?

Best Answer

It is said that bagging reduces variance and boosting reduces bias.

Indeed, as opposed to the base learners both ensembling methods employ.

For bagging and random forests, deep/large trees are generally employed as base learners. Large trees have high variance, but low bias. Ensembling many large trees reduces the variance.

Boosting is most effective with 'weak learners': base learners that perform slightly better than chance. Small trees generally work best, often stumps (i.e., single-split trees) are even used with boosting. Small trees have low variance, but high bias. Averaging over many trees (combined with updating the response variable after fitting each tree, which puts more weight on training observations not well predicted thus far) thus reduces the bias.