Well, first of all in the presentation you mentioned they just used a value of one feature is larger/smaller than some threshold (i.e. decision tree of a depth 1) as a partial classifier, thus this feature-classifier ambiguity.
Going back to the question, there are numerous way to get feature ranking from a boosting -- starting from counting how deep in boost structure the feature lies up to some permutation tests. In the work you quoted there is in fact no feature selection -- in training they just select best feature/threshold pair, in prediction they compute features "on demand" while the prediction goes through a boost to save computational time.
There are some additional details which does not fit into a comment. So, I will make an answer from those.
The initial revision of AdaBoost was designed for binary classification. The only requirement for the boosting procedure to progress further is that that each new weak model to have an error less than 0.5. The original AdaBoost algorithm can be applied to classifications with more than 2 classes, but the constraint remains (error less than 0.5), and the classifiers could no more be called weak.
One of the simplest and effective adaptations of AdaBoost to multiple classes is called AdaBoost.SAMME
. This algorithm is similar with the original, the only difference being that for computation of $\alpha^{(m)}$ an additional term $log(K-1)$ is added. This makes the original requirement to relax from $0.5$ to $(K-1)/K$, where $K$ is the number of classes. This is one improvement in case you use AdaBoost for classification with $K>2$ labels.
The second improvement I often saw, and I implemented myself, is to use bootstrap samples to train each of the weak instance classifier. This might accelerate the learning but poses some problems related with the decision which the algorithm should take when $\alpha^{(m)} \leq 0$. The problems arises from the fact that the weak learner have a proper error on the bootstrap sample, but on original sample it might not. This complicates the decision the algorithm should take.
Finally about the meaning of negative alpha and decisions based on that.
If $\alpha^{(m)}$ is $0$ than nothing new the algorithm have learned. If it is negative, than it might mean that it will do damage if added (if you do not use bootstrapping, in that case you can't be sure). So, when such situation is encountered, a decision must be made by the algorithm. From what I saw in implementations the following decisions are implemented here and there:
- stop the learning and exit the loop; this decision is made particularly if you want too save some memory and keep the model short
- do not add the learner to the boosting weak learners and continue the procedure to try to build another learner; this make sense when the weak model has some random procedures and does not give always the same results or when you use bootstrap samples which induces some variance by itself
- add the learner to the instances and proceed normally; I did not saw this thing anywhere else, I implemented thought in my own library and it gives surprisingly excellent results (my internal intuition right now is that it behaves more like a random jump from a local maximum in search for global maximum).
[Later edit: additional info]
The paper which describes the algorithm is Hastie, T., Rosset, S., Zhu, J., & Zou, H. (2009). Multi-class AdaBoost. Statistics and Its Interface, 2(3), 349–360. doi:10.4310/SII.2009.v2.n3.a8. There you can find the version of multi-class AdaBoost called AdaBoost.SAMME
. The single additional change is the term $log(K-1)$ as I described in the previous part. I do not state that this is the single version of the multi-class AdaBoost, I knew about the version described in the link you said. I simply prefer that one because is easy to explain and easy to adapt the algorithm. Beside of that you can note that the scikit-learn library (a popular library for machine learning written in Python has the same approach, see link to documentation for scikit-learn 0.14).
Regarding the bootstrapping for boosting, I found information about that in ESTL, the well-known book. I own the second edition, but it can be downloaded freely from Internet. At pages 364-365 are described two features which can be add in order to enforce some regularization. The first one is shrinkage, the second one is sub-sampling. These are described for gradient boosting, but sub-sampling can be generalized for any boosting procedure. As a consequence I use that in my own code, and I noticed that also in Weka is used (thought I did not search if anybody else have implemented that).
Regarding the decision when alpha is negative, I stated that the decision to continue when alpha is negative, especially when you use sub-sampling (bootstraps), have some sense to me, it gives good results, but is my own idea. I do not exclude that maybe somebody else put that option in a paper, I have no knowledge on that and I found that from practice.
PS: If I am wrong I would really like to see how, because the main purpose of mine and I believe of everybody around here is to learn. So I would be really glad to learn further.
Best Answer
mpq has already explained boosting in plain english.
A picture may substitute a thousand words ... (stolen from R. Meir and G. Rätsch. An introduction to boosting and leveraging)
Image remark: In the 1st Iteration the classifier based on all datapoints classifies all points correctly except those in x<0.2/y>0.8 and the point around 0.4/0.55 (see the circles in the second picture). In the second iteration exactly those points gain a higher weight so that the classifier based on that weighted sample classifies them correctly (2nd Iteration, added dashed line). The combined classifiers (i.e. "the combination of the dashed lines") result in the classifier represent by the green line. Now the second classifier produces another missclassifications (x in [0.5,0.6] / y in [0.3,0.4]), which gain more focus in the third iteration and so on and so on. In every step, the combined classifier gets closer and closer to the best shape (although not continuously). The final classifier (i.e. the combination of all single classifiers) in the 100th Iteration classifies all points correctly.
Now it should be more clear how boosting works. Two questions remain about the algorithmic details.
1. How to estimate missclassifications ?
In every iteration, only a sample of the training data available in this iteration is used for training the classifier, the rest is used for estimating the error / the missclassifications.
2. How to apply the weights ?
You can do this in the following ways:
Regarding your document example:
Imagine a certain word separates the classes perfectly, but it only appears in a certain part of the data space (i.e. near the decision boundary). Then the word has no power to separate all documents (hence it's expressiveness for the whole dataset is low), but only those near the boundary (where the expressiveness is high). Hence the documents containing this word will be misclassified in the first iteration(s), and hence gain more focus in later applications. Restricting the dataspace to the boundary (by document weighting), your classifier will/should detect the expressiveness of the word and classify the examples in that subspace correctly.
(God help me, I can't think of a more precise example. If the muse later decides to spend some time with me, I will edit my answer).
Note that boosting assumes weak classifiers. E.g. Boosting applied together with NaiveBayes will have no significant effect (at least regarding my experience).
edit: Added some details on the algorithm and explanation for the picture.