Solved – Derivation of AdaBoost.R2 algorithm

adaboostregression

I am having difficulty understanding the derivation of the AdaBoost.R2 algorithm (AdaBoost for regression problems), as presented in this paper by Drucker (page 2), which seems to be the source that people cite when referring to it. Notably, sklearn's (the python library) implentation is based on this paper. He says it is a modification of the AdaBoost.R algorithm proposed by Freund and Schapire here (page 136).

There are 2 questions I have:

  1. Why is the final prediction of AdaBoost.R2 the weighted median of all of weak learners? This is in contrast to AdaBoost for classification problems, which takes a weighted average.
  2. It is my understanding (from section 10.4 in Hastie's Elements of Statistical Learning) that AdaBoost for classification can be understood as gradient boosting (Hastie, algorithm 10.2) with the exponential loss function $L=\exp(-y \,f(x))$, where $y$ is the target class and $f(x)$ is our prediction. It is a straightforward exercise to show that the AdaBoost.M1 algorithm (algorithm 10.1 in Hastie) follows from this approach. AdaBoost for regression does not use the exponential loss function, but can it, too, be derived in a similar manner?

Finally, in general, if there are any other sources that go over AdaBoost's implementations for regression problems (not necessarily just for regression trees), I would appreciate any links.

Thank you!

Best Answer

  1. The choice to use the weighted median appears to be arbitrary. According to this

    "the idea of using the weighted median as the final regressor is not new. Freund [6] briefly mentions it and proves a special case of the main theorem of this paper. The ADABOOST.R algorithm of Freund and Schapire [7] returns the weighted median but the response space is restricted to [0;1] and the parameter updating steps are rather complicated. Drucker [4] also uses the weighted median of the base regressors as the final regressor but the parameter updates are heuristic and the convergence of the method is not analyzed."
    That crazy formula with the logs in it is evidently only one way to solve for the median; IBM gives a different formula.

    It is possible the researchers liked this evaluation method because they thought it was more robust to noise or something, but I am not sure whether their reasons were well-founded. The author of the first paper seems to just like median-based approaches because they generalize AdaBoost, not because they are by some metric better.

  2. Yes, in the classification case it is easy to recognize that the exponential in $(\gamma_m, G_m) = arg \min_{\gamma,G} \sum_{i=1}^n e^{-y_i(f_{m-1}(x_i) + \gamma G(x_i))}$ can be broken up to give $w_i = e^{-y_i f_{m-1}(x_i)}$ (the exponential loss of the model thus far), which blows up as y and f(x) diverge, which has famously ill effects in cases where outliers or mislabeled data are present. But in AdaBoost.R the weights are updated according to complicated formulas, and in AdaBoost.R2 the updates are given by $\beta^{1-L_i} = (\frac{\bar{L}}{1-\bar{L}})^{1-L_i}$, which has this funky shape (qualitatively, since $\bar{L}$ and $L_i$ are related but not exactly the same thing) that actually weights the examples with the greatest loss as not-as-important as the ones with slightly less error, which probably actually helps make it more resilient to outliers. These update rules do not just fall out of the loss function; they are constructed on top of the loss function. All a loss function is supposed to do is answer the question "How far away am I?" What you do once you know doesn't need to be related to how you measure. It can be a "heuristic". You might even engineer in some features like noise tolerance. It would be nice if Drucker proved his choice is a good one, but what matters to most people is that it seems to work in practice.

Related Question