Solved – Adjusting sample weights in AdaBoost

boostingmachine learning

I am trying to read up about AdaBoost from Tibshirani (page 337 onwards), and would appreciate some help in understanding it better.

The book says that "For each successive iteration m = 2, 3, . . . , M the observation weights are individually modified and the classification algorithm is reapplied to the weighted observations. At step m, those observations that were misclassified by the classifier Gm−1 (x) induced at the previous step have their weights increased, whereas the weights are decreased for those that were classified correctly. Thus as iterations proceed, observations that are difficult to classify correctly receive ever-increasing influence. Each successive classifier is thereby forced to concentrate on those training observations that are missed by previous ones in the sequence."

I am not able to understand what it means to "reapply the algorithm to the weighted observations". Say for example if I am doing 2-class text classification, what am I doing to my observations (document vectors)? How does this "force" the classifier to concentrate on some samples vs others?

Best Answer

mpq has already explained boosting in plain english.

A picture may substitute a thousand words ... (stolen from R. Meir and G. Rätsch. An introduction to boosting and leveraging) example adaboost

Image remark: In the 1st Iteration the classifier based on all datapoints classifies all points correctly except those in x<0.2/y>0.8 and the point around 0.4/0.55 (see the circles in the second picture). In the second iteration exactly those points gain a higher weight so that the classifier based on that weighted sample classifies them correctly (2nd Iteration, added dashed line). The combined classifiers (i.e. "the combination of the dashed lines") result in the classifier represent by the green line. Now the second classifier produces another missclassifications (x in [0.5,0.6] / y in [0.3,0.4]), which gain more focus in the third iteration and so on and so on. In every step, the combined classifier gets closer and closer to the best shape (although not continuously). The final classifier (i.e. the combination of all single classifiers) in the 100th Iteration classifies all points correctly.

Now it should be more clear how boosting works. Two questions remain about the algorithmic details.

1. How to estimate missclassifications ?

In every iteration, only a sample of the training data available in this iteration is used for training the classifier, the rest is used for estimating the error / the missclassifications.

2. How to apply the weights ?

You can do this in the following ways:

  • You sample the data using an appropriate sample algorithm which can handle weights (e.g. weighted random sampling or rejection sampling) and build classification model on that sample. The resulting sample contains missclassified examples with a higher probability than correctly classified ones, hence the model learned on that sample is forced to concentrate on the missclassified part of the data space.
  • You use a classification model which is capable of handling such weights implicitly like e.g. Decision Trees. DT simply count. So instead of using 1 as counter/increment if an example with a certain predictor and class values is presented, it uses the specified weight w. If w = 0, the example is practically ignored. As a result, the missclassified examples have more influence on the class probability estimated by the model.

Regarding your document example:

Imagine a certain word separates the classes perfectly, but it only appears in a certain part of the data space (i.e. near the decision boundary). Then the word has no power to separate all documents (hence it's expressiveness for the whole dataset is low), but only those near the boundary (where the expressiveness is high). Hence the documents containing this word will be misclassified in the first iteration(s), and hence gain more focus in later applications. Restricting the dataspace to the boundary (by document weighting), your classifier will/should detect the expressiveness of the word and classify the examples in that subspace correctly.

(God help me, I can't think of a more precise example. If the muse later decides to spend some time with me, I will edit my answer).

Note that boosting assumes weak classifiers. E.g. Boosting applied together with NaiveBayes will have no significant effect (at least regarding my experience).

edit: Added some details on the algorithm and explanation for the picture.