Gradient Boosting – How to Keep Classification Prediction in [0,1]

boostingclassificationlogistic

The question

I am struggling to understand how the prediction is kept within the $[0,1]$ interval when doing binary classification with Gradient Boosting.

Assume we are working on a binary classification problem, and our objective function is the log loss, $-\sum y_i \log(H_m(x_i)) + (1-y_i) \log(1-H_m(x_i))$, where $y$ is the target variable $\in \{0,1\}$ and $H$ is our current model.

When training the next weak learner $h_i$ such that our new model is $H_i = H_{i-1} + h_i$, what is the mechanism that is supposed to keep $H_i \in [0,1]$? Or, maybe a more relevant question, is there such a mechanism?

More information on what I am doing

I am trying to implement gradient boosting, using regression trees. What I do to avoid it is that a multiply $h_i$ by a factor $c \in [0,c_{\text{max}}]$, such that $H + c_{\text{max}}h$ does not go below zero or above one, and I select the $c$ in this range that minimizes the loss function.

This brings the following problem: After some rounds, I have one point that is perfectly classified, and the best split available to push the classifier in the direction of the gradient wants to push this point above one, which I make sure does not happen by setting $c = 0$. Thus all next iteration will select the same split and the same $c = 0$.

I tried common regularization practices

Decreasing the learning rate by multiplying $c$ by $\mu = 0.01$. This just delays the problem.
Subsampling the feature space, but some of the points are very easy to classify, they tick almost every box in the "is this a positive?" form, and almost every "good split" shows this behavior.

I think this is not a problem of parameters, and there should be a more sound way to fix this. I am not discarding the possibility that my implementation is broken, but I have found nothing adressing this problem.

What we are manipulating, in the context of the logistic loss, should be a probability, so how do we avoid it?

My intuition would be to put the model we are building, $H$, in a sigmoid function such that it is bounded to $[0,1]$, and I guess that would work, but I want to know if there are other solutions. Since gradient boosting seems used succesfully in classification tasks, a "correct" (i.e., with justification) solution should exists.

Best Answer

I like to think of this in analogy with the case of linear models, and their extension to GLMs (generalized linear models).

In a linear model, we fit a linear function to predict our response

$$ \hat y = \beta_0 + \beta_1 x_1 + \cdots \beta_n x_n $$

To generalize to other situations, we introduce a link function, which transforms the linear part of the model onto the scale of the response (technically this is an inverse link, but I think it's easier to think of it this way, transforming the linear predictor into a response, than transforming the response into a linear predictor).

For example, the logistic model uses the sigmoid (or logit) function

$$ \hat y = \frac{1}{1 + \exp(-(\beta_0 + \beta_1 x_1 + \cdots \beta_n x_n))} $$

and poisson regression uses an exponential function

$$ \hat y = \exp(\beta_0 + \beta_1 x_1 + \cdots \beta_n x_n) $$

To construct an analogy with gradient boosting, we replace the linear part of these models with the sum of the boosted trees. So, for example, the gaussian case (analogous with linear regression) becomes the well known

$$ \hat y = \sum_i h_i $$

where $h_i$ is our sequence of weak learners. The binomial case is analogous to logistic regression (as you noted in your answer)

$$ \hat y = \frac{1}{1 + \exp\left(-\sum_i h_i\right)} $$

and poisson boosting is analogous to poisson regression

$$ \hat y = \exp\left(\sum_i h_i\right) $$

The question remains, how does one fit these boosted models when the link function is involved? For the gaussian case, where the link is the identity function, the often heard mantra of fitting weak learners to the residuals of the current working model works out, but this doesn't really generalize to the more complicated models. The trick is to write the loss function being minimized as a function of the linear part of the model (i.e. the $\sum_i \beta_i x_i$ part of the GLM formulation).

For example, the binomial loss is usually encountered as

$$ \sum_i y_i \log(p_i) + (1 - y_i)\log(1 - p_i) $$

Here, the loss is a function of $p_i$, the predicted values on the same scale as the response, and $p_i$ is a non-linear transformation of the linear predictor $L_i$. Instead, we can re-express this as a function of $L_i$, (in this case also known as the log odds)

$$ \sum_i y_i L_i - \log(1 + \exp(L_i)) $$

Then we can take the gradient of this with respect to $L$, and boost to directly minimize this quantity.

Only at the very end, when we want to produce predictions for the user, do we apply the link function to the final sequence of weak learners to put the predictions on the same scale as the response. While fitting the model, we internally work on the linear scale the entire time.

Best Answer

Related Solutions

Solved – The relationship between adaboost and gradient boosting

XGBoost vs Gradient Boosting Machines – Comparison

Related Question