Solved – How does GBM work with a Poisson loss fonction

boostingpoisson-regression

I’m familiar with Poisson Regression and the concept of GBM (a tree learn from the residual of the previous tree) but I don’t understand how GBM works in the case of a Poisson loss function.

Questions:

  • How is the residual transformed before being fit by a tree? Log of the residual?
  • Do we need to apply a transform at the end to get the prediction?
    Exemple : prediction = exp(average of Y + prediction of the first tree + prediction of the second tree + …)

Best Answer

I am going to define the GBM algorithm first to clarify the question:

Algorithm 10.3: Gradient Tree Boosting Algorithm from Hastie's The Elements of Statistical Learning states the following:

enter image description here

Where $N$ is the number of samples, $M$ is the number of iterations and $J_m$ is the number of terminal regions or size of the tree. Line 3 produces $\hat{f}(x)$ which is a $K$ sized vector where $K$ corresponds to the number of classes.

How is the residual transformed before being fit by a tree? Log of the residual?

The residual is defined above by (a) and depends directly on the loss function. Similar to other problems, the loss function depends on the distribution chosen to model the conditional probability of y|x and is analogous to the negative log-likelihood of the distribution. For a Poisson distribution the log-likelihood is:

$$ll(y;\lambda) = \sum_k y_k\log(\lambda_k) - \lambda_k - \log(x_k!)$$

Or the loss:

$$L(y;\lambda) = \sum_k \lambda_k + \log(x_k!) - y_k\log(\lambda_k) $$

Since we want to minimize the loss and will eventually be taking a derivative, we can drop the constant term $\log(x_k!)$ and the above simplifies to:

$$L(y;\lambda) = \sum_k \lambda_k - y_k\log(\lambda_k) $$

In the algorithm outlined above, the residual is equivalent to the partial derivative w.r.t $f(x_i)$ evaluated at $f_{m-1}$. This can be interpreted as the residual produced by our previous update of $f$.

Do we need to apply a transform at the end to get the prediction? Exemple : prediction = exp(average of Y + prediction of the first tree + prediction of the second tree + …)

To produce probabilities the following transformation is used:

$$p_k(x) = \frac{\exp{f_k(x)}}{\sum_{l=1}^K \exp{f_l(x)}}$$ where again, $k$ is the class of interest. To make a prediction we simply take the $k$ that produces the maximum $p$.