Solved – Do boosting techniques use voting like any other ensemble method

boostingensemble learningmachine learningvoting-system

Can we generalize all ensemble methods by using voting? Do boosting methods also use voting to get the weak learners into the final model?

My understanding of the technique:

  • Boosting: Continuously adds in weak learner to boost the data points that were not correctly classified.
  • Ensemble technique: Uses multiple learners to obtain a better prediction than from one alone. This is explained in wikipedia.

Best Answer

Boosting can generally be understood as (weighted) voting

In the case of boosting, one of its inventors gives an affirmative answer in this introduction to AdaBoost (emphasis mine):

The final or combined hypothesis $H$ computes the sign of a weighted combination of weak hypotheses $$F(x) = \sum_{t=1}^T\alpha_th_t(x)$$ This is equivalent to saying that $H$ is computed as a weighted majority vote of the weak hypotheses $h_t$ where each is assigned weight $\alpha_t$. (In this chapter, we use the terms “hypothesis” and “classifier” interchangeably.)

So yes, the final model returned is a weighted vote of all the weak learners trained to that iteration. Likewise, you'll find this snippet on Wikipedia about boosting in general:

While boosting is not algorithmically constrained, most boosting algorithms consist of iteratively learning weak classifiers with respect to a distribution and adding them to a final strong classifier. When they are added, they are typically weighted in some way that is usually related to the weak learners' accuracy.

Also note the mention therein that the original boosting algorithms used a "majority." The notion of voting is pretty firmly baked into boosting: Its guiding principle is to improve an ensemble at each iteration by adding a new voter, then deciding how much weight to give each vote.

This same intuition carries for the example of gradient boosting: At each iteration $m$ we find a new learner $h_m$ fitted to pseudo-residuals, then optimize $\gamma_m$ to decide how much weight to give $h_m$'s "vote."

Extending to all ensemble methods runs into counterexamples

As it is, some would find that even the notion of weighting stretches the voting metaphor. When considering whether to extend this intuition to all ensemble learning methods, consider this snippet:

Ensembles combine multiple hypotheses to form a (hopefully) better hypothesis. The term ensemble is usually reserved for methods that generate multiple hypotheses using the same base learner.

And this one on the example ensemble method of stacking:

Stacking (sometimes called stacked generalization) involves training a learning algorithm to combine the predictions of several other learning algorithms. First, all of the other algorithms are trained using the available data, then a combiner algorithm is trained to make a final prediction using all the predictions of the other algorithms as additional inputs. If an arbitrary combiner algorithm is used, then stacking can theoretically represent any of the ensemble techniques described in this article, although in practice, a single-layer logistic regression model is often used as the combiner.

If you're defining ensemble methods to include stacking methods with an arbitrary combiner, you can construct methods that, in my view, stretch the notion of voting beyond its limit. It's difficult to see how a collection of weak learners combined via a decision tree or neural network can be viewed as "voting." (Leaving aside the also difficult question of when that method might prove practically useful.)

Some introductions describe ensembles and voting as synonymous; I'm not familiar enough with recent literature on these methods to say how these terms are generally applied recently, but I hope this answer gives an idea of how far the notion of voting extends.

Related Question