I have been trying to understand gradient boosting reading various blogs, websites and trying to find my answer by looking through for example the XGBoost source code. However, I cannot seem to find an understandable explanation of how gradient boosting algorithms produce probability estimates. So, how do they calculate the probabilities?
Gradient Boosting – Calculating Probability Estimates
boostingclassificationensemble learningmachine learning
Related Solutions
You have not added any random noise to your data. The learning rate is a regularization strategy to protect your model from over fitting. If your data is noiseless, and you include all the variables that are in reality related to the response, then you will not overfit.
Try something like this:
z_signal = np.sin(XY[:,0] * 10 + XY[:,1]*3) + np.cos(XY[:,0] *2 ) + 3*np.sin(XY[:,1] * 5)
z = z_signal + random.normal(XY.shape[0])
Solved – Multiclass gradient boosting: how to derive the initial guess, how to predict a probability
As you correctly recognise, during the first step 1 we cannot assign $f_{m−1}(x_i)$ to anything as we have yet to estimate $f$. We usually set it as the mean of the $y_i$ across all the samples or some "version of the central tendency". Indeed for binary classification we use log-odds; effectively np.log(proba_positive_class / (1 - proba_positive_class))
.
When we work with multi-class classification (assuming $M$ separate classes, $M$>2) our raw predictions are of dimensions $N \times M, $N being the number of samples. In that sense, we can calculate the log-odds for each single class label in a one-vs-all manner quite naturally using the relative frequencies of each class in our response vector.
Notice that in reality given we do not assume some outlandish baseline, after the first few dozen iterations the difference will be nominal. For example, XGBoost sets its "initial guess" of the log-odds to be 0.50 and ignores the relative label frequencies. In a somewhat more educated vein, sklearn's gradient booster will set the "initial guess" of the log-odds as np.log(proba_kth_class)
so not exactly the log-odds either; LightGBM follows with that logic too (i.e. boosts from average).
Finally, yes, whatever the raw estimate is then we apply the softmax on it. Just be aware that for the mutli-class case we use exp(raw_preds - log(sum(exp(raw_preds))))
based on LogSumExp; this is effectively the same as: $\frac{e^{z_i}}{ \sum_{i=1}^M e^{z_i}}$, assuming that $z_i$ is our raw scores.
Ah, and a quick example of how the softmax works:
library(xgboost)
data(iris)
lb <- as.numeric(iris$Species) - 1
num_class <- 3
set.seed(11)
N = 120
bst <- xgboost(data = as.matrix(iris[1:N, -5]), label = lb[1:N],
max_depth = 4, eta = 0.5, nthread = 2, nrounds = 10,
subsample = 0.15, objective = "multi:softprob",
num_class = num_class, verbose = FALSE)
predict(bst, as.matrix(iris[N, -5]), outputmargin = TRUE) # Raw scores
# -1.247365 1.584843 1.164099
predict(bst, as.matrix(iris[N, -5]), outputmargin = FALSE) # Probabilities
# 0.03432514 0.58294052 0.38273433
manual_sm <- function(rs) exp(rs - log(sum(exp(rs)))) # Manual LogSumExp
manual_sm(c(-1.247365, 1.584843, 1.164099))
# 0.03432511 0.58294053 0.38273436
Best Answer
TL;DR: The log-odds for a sample is the sum of the weights of its terminal leafs. The probability of the sample belonging to class 1 is the inverse-logit transformation of the sum.
Analogously to logistic regression, the logistic function computes probabilities that are linear on the logit scale:
$$ z = Xw \\ \mathbb{P}(y=1|X) = \frac{1}{1 + \exp(-z)} $$
Unlike logistic regression, the "features" in $X$ are constructed as the terminal nodes of an ensemble of decision trees using the boosting procedure. Each row of $X$ collects the terminal leafs for each sample; the row is a $T$-hot binary vector, for $T$ the number of trees. (Each XGBoost tree is generated according to a particular algorithm, but that's not relevant here.)
There are $n$ columns in $X$, one column for each terminal node. There is no expression for the total number of terminal nodes, because the number of nodes can vary between trees (and usually does, in my experience).
Each leaf in the tree has an associated "weight." That weight is recorded in $w$. To be conformable with $X$, there are $n$ elements in $w$. The weights themselves are derived from the gradient boosting procedure; see: In XGboost are weights estimated for each sample and then averaged