mpq has already explained boosting in plain english.
A picture may substitute a thousand words ... (stolen from R. Meir and G. Rätsch. An introduction to boosting and leveraging)
Image remark: In the 1st Iteration the classifier based on all datapoints classifies all points correctly except those in x<0.2/y>0.8 and the point around 0.4/0.55 (see the circles in the second picture). In the second iteration exactly those points gain a higher weight so that the classifier based on that weighted sample classifies them correctly (2nd Iteration, added dashed line). The combined classifiers (i.e. "the combination of the dashed lines") result in the classifier represent by the green line. Now the second classifier produces another missclassifications (x in [0.5,0.6] / y in [0.3,0.4]), which gain more focus in the third iteration and so on and so on. In every step, the combined classifier gets closer and closer to the best shape (although not continuously). The final classifier (i.e. the combination of all single classifiers) in the 100th Iteration classifies all points correctly.
Now it should be more clear how boosting works. Two questions remain about the algorithmic details.
1. How to estimate missclassifications ?
In every iteration, only a sample of the training data available in this iteration is used for training the classifier, the rest is used for estimating the error / the missclassifications.
2. How to apply the weights ?
You can do this in the following ways:
- You sample the data using an appropriate sample algorithm which can handle weights (e.g. weighted random sampling or rejection sampling) and build classification model on that sample. The resulting sample contains missclassified examples with a higher probability than correctly classified ones, hence the model learned on that sample is forced to concentrate on the missclassified part of the data space.
- You use a classification model which is capable of handling such weights implicitly like e.g. Decision Trees. DT simply count. So instead of using 1 as counter/increment if an example with a certain predictor and class values is presented, it uses the specified weight w. If w = 0, the example is practically ignored. As a result, the missclassified examples have more influence on the class probability estimated by the model.
Regarding your document example:
Imagine a certain word separates the classes perfectly, but it only appears in a certain part of the data space (i.e. near the decision boundary). Then the word has no power to separate all documents (hence it's expressiveness for the whole dataset is low), but only those near the boundary (where the expressiveness is high). Hence the documents containing this word will be misclassified in the first iteration(s), and hence gain more focus in later applications. Restricting the dataspace to the boundary (by document weighting), your classifier will/should detect the expressiveness of the word and classify the examples in that subspace correctly.
(God help me, I can't think of a more precise example. If the muse later decides to spend some time with me, I will edit my answer).
Note that boosting assumes weak classifiers. E.g. Boosting applied together with NaiveBayes will have no significant effect (at least regarding my experience).
edit: Added some details on the algorithm and explanation for the picture.
I think I've found an explanation for this question. We can expand the function:
$\ell(h(x),y)=(h(x)-y\exp(-yf(x)))^2=h^2(x)+(y\exp(-yf(x)))^2-2h(x)y\exp(-yf(x))$.
Since we focus on classification problem here, we have $y\in\{-1,+1\}$ and $h(x)\in\{-1,+1\}$. Therefore, only the last term $-2h(x)y\exp(-yf(x))$ matters. Let $w=\exp(-yf(x))$, then we can see that Eq. 1 and Eq. 2. are equivalent.
Best Answer
The main difference between AdaBoost and other "generic" boosting algorithms is that AdaBoost uses the (deviance) residuals as weights while "generic" gradient boosting algorithms use the residuals as the learning target itself.
This different between AdaBoost and other "generic" Gradient Boosting Machine (GBM) methodologies is more prominent when we examine a "generic" GBM as an additive model where we find the solution iteratively via the Backfitting algorithm (one can see Elements of Statistical Learning, Hastie et al. (2009) Ch. 10.2 "Boosting Fits an Additive Model" for the relation between boosting and additive models in more detail). To that extent you can look into LogitBoost as it effectively is the one that bridges AdaBoost and the "generic" GBM framework. The "LogitBoost paper" (Additive logistic regression: a statistical view of boosting, Friedman et al. (2000)) has a specific section (Sect. 4, AdaBoost: an additive logistic regression model) that focuses on the interpretation of AdaBoost as a stage-wise estimation procedure for fitting an additive logistic regression model. It focuses on how AdaBoost minimises the expectation of the exponential loss $E\{ e^{-y F(x)} \}$ via an iterative approach.