Solved – Values of the weights in Adaboost

boostingfeature selection

I have implemented a simple Adaboost algorithm, using several weak classifiers, and when checking the values computed by it there are alphas with a negative value. Is that possible, or is there a bug in the implementation?

Thank you.

Best Answer

There are some additional details which does not fit into a comment. So, I will make an answer from those.

The initial revision of AdaBoost was designed for binary classification. The only requirement for the boosting procedure to progress further is that that each new weak model to have an error less than 0.5. The original AdaBoost algorithm can be applied to classifications with more than 2 classes, but the constraint remains (error less than 0.5), and the classifiers could no more be called weak.

One of the simplest and effective adaptations of AdaBoost to multiple classes is called AdaBoost.SAMME. This algorithm is similar with the original, the only difference being that for computation of $\alpha^{(m)}$ an additional term $log(K-1)$ is added. This makes the original requirement to relax from $0.5$ to $(K-1)/K$, where $K$ is the number of classes. This is one improvement in case you use AdaBoost for classification with $K>2$ labels.

The second improvement I often saw, and I implemented myself, is to use bootstrap samples to train each of the weak instance classifier. This might accelerate the learning but poses some problems related with the decision which the algorithm should take when $\alpha^{(m)} \leq 0$. The problems arises from the fact that the weak learner have a proper error on the bootstrap sample, but on original sample it might not. This complicates the decision the algorithm should take.

Finally about the meaning of negative alpha and decisions based on that.

If $\alpha^{(m)}$ is $0$ than nothing new the algorithm have learned. If it is negative, than it might mean that it will do damage if added (if you do not use bootstrapping, in that case you can't be sure). So, when such situation is encountered, a decision must be made by the algorithm. From what I saw in implementations the following decisions are implemented here and there:

  • stop the learning and exit the loop; this decision is made particularly if you want too save some memory and keep the model short
  • do not add the learner to the boosting weak learners and continue the procedure to try to build another learner; this make sense when the weak model has some random procedures and does not give always the same results or when you use bootstrap samples which induces some variance by itself
  • add the learner to the instances and proceed normally; I did not saw this thing anywhere else, I implemented thought in my own library and it gives surprisingly excellent results (my internal intuition right now is that it behaves more like a random jump from a local maximum in search for global maximum).

[Later edit: additional info]

The paper which describes the algorithm is Hastie, T., Rosset, S., Zhu, J., & Zou, H. (2009). Multi-class AdaBoost. Statistics and Its Interface, 2(3), 349–360. doi:10.4310/SII.2009.v2.n3.a8. There you can find the version of multi-class AdaBoost called AdaBoost.SAMME. The single additional change is the term $log(K-1)$ as I described in the previous part. I do not state that this is the single version of the multi-class AdaBoost, I knew about the version described in the link you said. I simply prefer that one because is easy to explain and easy to adapt the algorithm. Beside of that you can note that the scikit-learn library (a popular library for machine learning written in Python has the same approach, see link to documentation for scikit-learn 0.14).

Regarding the bootstrapping for boosting, I found information about that in ESTL, the well-known book. I own the second edition, but it can be downloaded freely from Internet. At pages 364-365 are described two features which can be add in order to enforce some regularization. The first one is shrinkage, the second one is sub-sampling. These are described for gradient boosting, but sub-sampling can be generalized for any boosting procedure. As a consequence I use that in my own code, and I noticed that also in Weka is used (thought I did not search if anybody else have implemented that).

Regarding the decision when alpha is negative, I stated that the decision to continue when alpha is negative, especially when you use sub-sampling (bootstraps), have some sense to me, it gives good results, but is my own idea. I do not exclude that maybe somebody else put that option in a paper, I have no knowledge on that and I found that from practice.

PS: If I am wrong I would really like to see how, because the main purpose of mine and I believe of everybody around here is to learn. So I would be really glad to learn further.

Related Question