Ensemble Learning – How to Combine Weak Classifiers to Create a Strong Model

classificationensemble learningmachine learning

Let as assume that we have a binary classification problem. We also have several classifiers. Instead of assigning a vector to a class (0 or 1) each classifier returns a probability that a given vector belongs to class 1. It means that for each input vector, that has to be classified we get a vector of real number between 0 and 1. For example:

(0.81, 0.67, 0.43, 0.99, 0.53)

where the number of components (probabilities) is equal to the number of classifiers. Now we want to "combine" these "weak" classifiers to get on "strong" classifier. In other words we need to find a way to map a given vector of probabilities into one number (probability).

So, my questions is: What is the "correct" way to do it? Of course I can train another classifier that uses the vector of probabilities and returns one probability. In other words we can find out how to combine the "weak" probabilities in an empirical way. However, I assume, that we can use the fact that the components of the vector are not just "some numbers" (or features) they are probabilities, they are already predictions and, as a consequence, they have to be combined in a corresponding appropriate way.

ADDED

In comments it has been proposed to average the "weak" probabilities. But what if it is possible to estimate quality (power) of each "weak" classifier (and it should be possible), doesn't it make sense to suppress "bad" classifier (for example by using their predictions (probabilities) with smaller weights or by ignoring them completely)? Does it makes sense to use just one (the best) weak classifier? Does it make sense to check correlation between the weak classifiers. For example what should we do if two "weak" classifiers always give the same result. Shouldn't we through one of them out as not having any additional value?

Best Answer

Practically speaking bagging, boosting, and stacking all constitute reasonable ways to combine the weak predictions, as others have mentioned. Taking that one step further though, trying as many of them as possible and seeing what performs best is common the context of competitive machine learning too (e.g. You might try a stacked classifier as well as well as a simple average ensembler and choose the better of the two as per performance on some hold-out data).

In fact, if accuracy is your only goal then you're likely better off having a few different ensembling techniques in mind like this and focusing on the diversity of the classifiers being ensembled rather than how they're combined. Since the best ensemble technique will depend on your problem and and data and there is no approach that's always best, your time is better spent just making the ensemble choice another part of the training process rather than worrying about making a single correct choice.

Theoretically speaking though, and as an attempt to answer the question more directly, I think there are some probabilistic arguments that could be used to justify the best way to "suppress" the weak classifiers. IMO the reasoning behind Bayesian Model Averaging and Information-Criteria-Based Averaging is pretty enlightening and has ties to some of the approaches in Machine Learning like weighting classifiers via binomial deviance. For example, here's a process for combining classifiers through the use of akaike weights (as an example of information-criteria based model averaging):

** Note this is all assuming the classifiers are pretty well calibrated and that actual out-of-sample deviance is used in cross-validation rather than an estimate of it like AIC

Suppose you have $K$ classifiers fit to training data and each of those classifiers then makes $N$ predictions on test data. You could then compute the "likelihood" of each models predictions as follows:

$$ L_k = \Pi_{i=1}^n{ P_{M_k}(y_i)}$$

where:

  • $L_k$ = Likelihood of predictions from model $k$
  • $y_i$ = Response $i$ in the test data
  • $P_{M_k}(y_i)$ = The probability that classifier $k$ attributes to $y_i$

Given the likelihood of the predicted data, the weight or relative likelihood of each classifier, $w_{M_k}$ could then be defined as: $$L_{max} = \max{L_k} $$ $$w_{M_k} = \frac{e^{2log(L_{max}/L_k)}}{\sum_{j=1}^{k}{e^{2log(L_{max}/L_k)}}}$$

The weights for each model, $w_{M_k}$, can then be interpreted as the probability that classifier $M_k$ is the true model and an expected outcome coming from an ensemble of all $k$ classifiers would have an expected value equal to the sum of the probability of each classifier times its prediction:

$$ y_{new} = \sum_{j=1}^k { w_{M_k} \cdot M_k(X_{new}) }.$$

I hope the notation doesn't bog you down but the point is that there are theoretical ways (the Bayesian ones are especially interesting) to determine what probability each model in an ensemble should have and then use that to make predictions, rather than some more heuristic weighting or equal-voting scheme. These more intuitive averaging strategies don't generally perform better in empirical studies (or so it seems), but I thought throwing the notion of them into the mix might help you like they helped me.