A few problems understanding Adaboost – Math Solves Everything

I am studying the following Adaboost algorithm:

$for (t=1,2,…,T):$

$h_t = argmin_{h\in H} \Sigma_{i=1}^m(D_i^t\cdot1_{[h(x_i)\neq y_i]}) $

$\epsilon_t = \Sigma_{i=1}^m(D_i^t\cdot1_{[h_t(x_i)\neq y_i]})$

$w_t = \frac{1}{2}ln(\frac{1-\epsilon_t}{\epsilon_t})$

$\forall i \in [m]: \hat D^{t+1}_i=D_i^t\cdot exp(-y_i\cdot w_t\cdot h_t(x_i))$

$D_i^{t+1}=\frac{\hat D_i^{t+1}}{\Sigma_{i=1}^m \hat D_i^{t+1}}$

and afterwards our prediction will be:

$h_s(x)=sign(\Sigma_{i=1}^T(w_i\cdot h_i(x)))$

To my understanding, $w_t$'s role is to determine how good $h_t$ is.

If the loss function is low, $w_t$ will be high and $h_s$ will care more about that certain $h_t$.

I got 2 problems with that method. The first regards $D^t$ while calculating $w_t$. It is good for calculating $D^{t+1}$ later, yet that is not what I am talking about. When I want to predict the loss of that function, I want $\epsilon$ to estimate the real loss of the hypothesis.

In my opinion you should calculate the loss of the hypothesis with uniform distribution, which represents the real world as best as possible.

The second problem, which is related to the first one, is: why do we calculate $\epsilon_t$ on the training set??

How come we don't use a test set for estimating the loss function the best we can? Off course a loss function on the training set does not give us as much information as it will on a new set of data.

I know we need those $\epsilon_t$ and $w_t$ for calculating $\hat D^{t+1}_i$, but can we not calculate another set for a better hypothesis of $h_s$?

Edit:
I want to clarify the method I suggest:

We take the data and seperate it in 3 parts: train, test1, test2. To make it easier to understand, the training size is m, and the test1 size is n.

$for (t=1,2,…,T):$

$h_t = argmin_{h\in H} \Sigma_{i=1}^m(D_i^t\cdot1_{[h(x_i)\neq y_i]}) $

$\epsilon_t = \Sigma_{i=1}^m(D_i^t\cdot1_{[h_t(x_i)\neq y_i]})$

$w_t = \frac{1}{2}ln(\frac{1-\epsilon_t}{\epsilon_t})$

$\epsilon'_t = \Sigma_{i=1}^n(\frac{1}{n}\cdot1_{[h_t(x_i)\neq y_i]})$

$w'_t = \frac{1}{2}ln(\frac{1-\epsilon'_t}{\epsilon'_t})$

$\forall i \in [m]: \hat D^{t+1}_i=D_i^t\cdot exp(-y_i\cdot w_t\cdot h_t(x_i))$

$D_i^{t+1}=\frac{\hat D_i^{t+1}}{\Sigma_{i=1}^m \hat D_i^{t+1}}$

afterwards our prediction will be:

$h_s(x)=sign(\Sigma_{i=1}^T(w'_i\cdot h_i(x)))$

Thanks a lot

Best Answer

Great job identifying that Adaboost tend to overfit and in fact, the use of validation set has been proposed in the past for example here.

I inspect your algorithm and it seems to have more potential to generalize better. However, one thing to be careful. Is it possible that you might be overfitting to the validation set instead since the weight on each classifier is determined entirely by the validation set with no sampling involved.

Casting the overfitting concern aside for now. Here is an idea for you. Currently, you chose some $w'_i$ and decided to use them but are you opened to the idea of using alternative weights instead? What you are trying to do seems to be constructing a linear classifier using the $h(x_i)=(h_1(x_i), h_2(x_i), \ldots, h_T(x_i))$ to predict $y_i$ where $x_i$ and $y_i$ are from the validation set.

You could have built other classifier model such as SVM model with regularization using $h$ as the features or logistic regression to predict $y_i$. You can even built a classifier that is more sparse by dropping insignificant predictors. You might want to prepare another validation set or do a $k$-fold validation to choose parameter for this stage.

Best Answer

Related Solutions

[Math] Smooth Hinge Loss Lipschitz Constant

Related Question