Solved – Where must we use Bagging or Boosting

baggingboostingclassificationoutliers

I want to know when Bagging is better than Boosting? How I select appropriate method for my classification task?

I think when we have many outliers in our data-set, Bagging must be better than Boosting. Because these outliers are susceptible to be misclassified in each iteration and it may cause over-fitting the model on outliers. Is it true?
Do you have any other idea for select best method between Bagging and Boosting?

Best Answer

  • Boosting in general has a higher risk of overfitting than bagging.
  • Mislabeled cases cause much more serious trouble with boosting than with bagging, similar to the outliers you mention.

  • I'd expect boosting to yield better results than bagging if you can reasonably expect that your submodels don't by themselves put enough weight on cases close to the class boundaries: bagging won't help with this particular aspect, but boosting would up-weight those cases.

  • Ensemble models in general help only in situations where the lack of performance is due to model instability.
    Variance uncertainty in the model prediction can have at least 2 causes: variance uncertainty in the model (instability) and variance uncertainty on the input data (noisy measurements). An aggregated predictor has improved stability (the "model noise" is averaged), but if the submodels were already stable (and the noise comes e.g. from the input data), aggregation won't improve the prediction*.

  • As boosting is iteratively refining the model, you need an "outer" independent test, whereas for bagged models you can use the out-of-bag cases for testing.

* Depending on the type of classifier and the actual procedure of aggregation (e.g. boosting to put more weight on cases close to the class boundaries), also the bias can be influenced - but typically only within the limits of the variance of the submodels much: if you think of the point cloud of the submodel prediction and of the aggregated prediction, the aggregated prediction will be within the cloud of submodel predictions.