Boosting – Why is Boosting Less Likely to Overfit?

adaboostboostingoverfitting

I've been learning about machine learning boosting methods (e.g., ADA boost, gradient boost) and the information sources mentioned that boosting tree methods are less likely to overfit than other machine learning methods. Why would that be the case?

Since boosting overweights inputs that were not predicted correctly, it seems like it could easily end up fitting the noise and overfitting the data, but I must be misunderstanding something.

Best Answer

The general idea is that each individual tree will over fit some parts of the data, but therefor will under fit other parts of the data. But in boosting, you don't use the individual trees, but rather "average" them all together, so for a particular data point (or group of points) the trees that over fit that point (those points) will be average with the under fitting trees and the combined average should neither over or under fit, but should be about right.

As with all models, you should try this out on some simulated data to help yourself understand what is going on. Also, as with all models, you should look at diagnostics and use your knowledge of the science and common sense to make sure that the modeling represents your data reasonably.

Related Question