Solved – Multiclass gradient boosting: how to derive the initial guess, how to predict a probability

boostinggradientr

I have some questions regarding multiclass boosted-tree-algorithmus. Currently, I apply xgBoost as implemented in R to solve a multi-classification problem.

According to StatQuest, for a simple two-classes case, the initial guess is:

p = (exp(log odds)) / (1 + exp (log odds))

(https://www.youtube.com/watch?v=jxuNLH5dXCs)

I could find no answer regarding how in multi-classification the initial guess is derived.

Furthermore, I suspect that the predict()-function in R for the method XGBoost utilizes some sort of softmax-function to predict probability values for single estimates.

I tried to understand the code but I did not really comprehend it.

Can you give a clear example of how to calculate such a probability using boosted trees? Does it relate to some sort of softmax-output or does it somehow relate to the sum of weights of those trees that agreed on the majority class?

I read different opinions about the last question and would love to have a final answer.

Thank you!

Best Answer

As you correctly recognise, during the first step 1 we cannot assign $f_{m−1}(x_i)$ to anything as we have yet to estimate $f$. We usually set it as the mean of the $y_i$ across all the samples or some "version of the central tendency". Indeed for binary classification we use log-odds; effectively np.log(proba_positive_class / (1 - proba_positive_class)).

When we work with multi-class classification (assuming $M$ separate classes, $M$>2) our raw predictions are of dimensions $N \times M, $N being the number of samples. In that sense, we can calculate the log-odds for each single class label in a one-vs-all manner quite naturally using the relative frequencies of each class in our response vector.

Notice that in reality given we do not assume some outlandish baseline, after the first few dozen iterations the difference will be nominal. For example, XGBoost sets its "initial guess" of the log-odds to be 0.50 and ignores the relative label frequencies. In a somewhat more educated vein, sklearn's gradient booster will set the "initial guess" of the log-odds as np.log(proba_kth_class) so not exactly the log-odds either; LightGBM follows with that logic too (i.e. boosts from average).

Finally, yes, whatever the raw estimate is then we apply the softmax on it. Just be aware that for the mutli-class case we use exp(raw_preds - log(sum(exp(raw_preds)))) based on LogSumExp; this is effectively the same as: $\frac{e^{z_i}}{ \sum_{i=1}^M e^{z_i}}$, assuming that $z_i$ is our raw scores.

Ah, and a quick example of how the softmax works:

library(xgboost)
data(iris)
lb <- as.numeric(iris$Species) - 1
num_class <- 3
set.seed(11)
N = 120
bst <- xgboost(data = as.matrix(iris[1:N, -5]), label = lb[1:N],
               max_depth = 4, eta = 0.5, nthread = 2, nrounds = 10, 
               subsample = 0.15, objective = "multi:softprob", 
               num_class = num_class, verbose = FALSE)  


predict(bst, as.matrix(iris[N, -5]), outputmargin = TRUE) # Raw scores
# -1.247365  1.584843  1.164099
predict(bst, as.matrix(iris[N, -5]), outputmargin = FALSE) # Probabilities
# 0.03432514 0.58294052 0.38273433

manual_sm <- function(rs)  exp(rs - log(sum(exp(rs)))) # Manual LogSumExp
manual_sm(c(-1.247365,  1.584843,  1.164099))
# 0.03432511 0.58294053 0.38273436