Machine Learning – Comprehensive Guide to Understanding Gradient Boosting

boostingclassificationensemble learningmachine learning

At a high level, I don't see how the ensemble of simple models obtained by gradient boosting is better than a single, more complicated model? What's the point of doing gradient boosting instead of a simple more complicated model?
Two specific scenarios below:

  1. In an article I read (Gradient boosting from scratch), there is an example of ensemble of simple trees (stumps) for regression. So why is the approach with gradient boosting better than a single more complicated tree of greater depth?

  2. a) In the case of linear regression, seems like it doesn't make sense to use gradient boosting. Can somebody explain why (or rebut)? It would help my understanding of both regression and boosting. For example, instead of doing regression on many features (perhaps even special one, like LASSO), do successive iterations of single-feature regressions, group them via gradient boosting.

    b) Same as 2a, only for logistic regression. I suspect here it may make sense, because the calculated function is not linear. But why would one apply gradient boosting with logistic regressions instead of, for example, regularized logistic regression?

Best Answer

I've answered question 2a on this site before.

The answer to 2b, as you suspect, is the same. In general, gradient boosting, when used for classification, fits trees not on the level of the gradient of predicted probabilities, but to the gradient of the predicted log-odds. Because of this, 2b reduces to 2a in principle.

As for 1:

Here there is an example of ensemble of simple trees (stumps) for regression. So why is the approach with gradient boosting better than a single more complicated tree of greater depth?

The power of gradient boosting is that it allows us to build predictive functions of great complexity. The issue with building predictive functions of great complexity is in the bias variance tradeoff. Large complexity means very low bias, which unfortunately is wed to very high variance.

If you fit a complex model in one go (like a deep decision tree for example) you have done nothing to deal with this variance explosion, and you will find that your test error is very poor.

Boosting is essentially a principled way of carefully controlling the variance of a model when attempting to build a complex predictive function. The main idea is that we should build the predictive function very slowly, and constantly check our work to see if we should stop building. This is why using a small learning rate and weak individual learners is so important to using boosting effectively. These choices allow us to layer on complexity very slowly, and apply a lot of care to constructing out predictive function. It allows us many places to stop, by monitoring the test error at each stage in the construction.

If you do not do this, your boosted model will be poor, often as poor as a single decision tree. Try setting the learning rate to $1.0$ in a gradient boosted model, or using very deep trees as individual learners.

Related Question