Solved – The relationship between adaboost and gradient boosting

I am reading the chapter 10 of "The Elements of Statistical Learning 2nd ed, (ESLII)", where the Adaboost algorithm is explained by minimizing the exponential loss using stagewise additive modelling approach. That is at the $m$th iteration, we fit a classifier $h\in\{-1,+1\}$ such that the following weighted classification error is minimized:

$h_m=\arg\min_h\sum_{i=1}^N w_i^m I(y_i\neq h(x_i))$, (1)

where $I(\cdot)$ is the indicator function, and $w_i^m=\exp(-y_if_{m-1}(x_i))$

I am wondering if the AdaBoost algorithm can be explained by the framework of gradient boosting, i.e., repeatedly fitting the negative gradient of exponential loss. At the beginning, I thought it was obvious. But when I derive the algorithm it seems that it is not the case.

Given the exponential loss $\ell(f(x),y)=\exp(-yf(x))$, the negative gradient is $r=y\exp(-yf(x))$. Therefore, at the $m$th iteration, we need to perform least square fit:
$h_m=\arg\min_h\sum_{i=1}^N(h(x_i)-y_i\exp(-y_if(x_i)))^2|_{f=f_{m-1}}$ (2).

I cannot verify the equivalence between Eq. 1 and Eq. 2. Actually, I doubt the if these two problems are the same.

Best Answer

I think I've found an explanation for this question. We can expand the function:

$\ell(h(x),y)=(h(x)-y\exp(-yf(x)))^2=h^2(x)+(y\exp(-yf(x)))^2-2h(x)y\exp(-yf(x))$.

Since we focus on classification problem here, we have $y\in\{-1,+1\}$ and $h(x)\in\{-1,+1\}$. Therefore, only the last term $-2h(x)y\exp(-yf(x))$ matters. Let $w=\exp(-yf(x))$, then we can see that Eq. 1 and Eq. 2. are equivalent.

Best Answer

Related Solutions

XGBoost vs Gradient Boosting Machines – Comparison

Solved – Is exponential loss function the only reason for AdaBoost being adaptive algorithm

Related Question