Practically speaking bagging, boosting, and stacking all constitute reasonable ways to combine the weak predictions, as others have mentioned. Taking that one step further though, trying as many of them as possible and seeing what performs best is common the context of competitive machine learning too (e.g. You might try a stacked classifier as well as well as a simple average ensembler and choose the better of the two as per performance on some hold-out data).
In fact, if accuracy is your only goal then you're likely better off having a few different ensembling techniques in mind like this and focusing on the diversity of the classifiers being ensembled rather than how they're combined. Since the best ensemble technique will depend on your problem and and data and there is no approach that's always best, your time is better spent just making the ensemble choice another part of the training process rather than worrying about making a single correct choice.
Theoretically speaking though, and as an attempt to answer the question more directly, I think there are some probabilistic arguments that could be used to justify the best way to "suppress" the weak classifiers. IMO the reasoning behind Bayesian Model Averaging and Information-Criteria-Based Averaging is pretty enlightening and has ties to some of the approaches in Machine Learning like weighting classifiers via binomial deviance. For example, here's a process for combining classifiers through the use of akaike weights (as an example of information-criteria based model averaging):
** Note this is all assuming the classifiers are pretty well calibrated and that actual out-of-sample deviance is used in cross-validation rather than an estimate of it like AIC
Suppose you have $K$ classifiers fit to training data and each of those classifiers then makes $N$ predictions on test data. You could then compute the "likelihood" of each models predictions as follows:
$$ L_k = \Pi_{i=1}^n{ P_{M_k}(y_i)}$$
where:
- $L_k$ = Likelihood of predictions from model $k$
- $y_i$ = Response $i$ in the test data
- $P_{M_k}(y_i)$ = The probability that classifier $k$ attributes to $y_i$
Given the likelihood of the predicted data, the weight or relative likelihood of each classifier, $w_{M_k}$ could then be defined as:
$$L_{max} = \max{L_k} $$
$$w_{M_k} = \frac{e^{2log(L_{max}/L_k)}}{\sum_{j=1}^{k}{e^{2log(L_{max}/L_k)}}}$$
The weights for each model, $w_{M_k}$, can then be interpreted as the probability that classifier $M_k$ is the true model and an expected outcome coming from an ensemble of all $k$ classifiers would have an expected value equal to the sum of the probability of each classifier times its prediction:
$$ y_{new} = \sum_{j=1}^k { w_{M_k} \cdot M_k(X_{new}) }.$$
I hope the notation doesn't bog you down but the point is that there are theoretical ways (the Bayesian ones are especially interesting) to determine what probability each model in an ensemble should have and then use that to make predictions, rather than some more heuristic weighting or equal-voting scheme. These more intuitive averaging strategies don't generally perform better in empirical studies (or so it seems), but I thought throwing the notion of them into the mix might help you like they helped me.
Best Answer
Your first question: This is probably intentionally formulated very vaguely on Wikipedia. But, as an example, you could think of Bayesian predictions as the weighted average of many models which differ continuously in their parameters. And often the weights are only large in a confined region, so the relevant models in this ensemble differ only by a small change in the parameters, which could qualify for small diversity in the models. A larger diversity would be obtained if you were to fit completely different models, e.g. a decision tree and an SVM.
Your second question: Following your citation of Wikipedia, a combination of strong classifiers would be preferable. But there are many examples where an ensemble of weak classifiers can obtain excellent results, like e.g. random forests or gradient boosting machines, provided the ensembles contain sufficiently many models. The performance is also very much dependent on how exactly you do the ensembling. Ordinary majority voting is a less sophisticated method, but, given enough models, might provide good results.
Most of the time the rule of thumb is: the more models, the better, no matter whether they are weak or strong.