GBM Tree Prediction – Interpretation of Single Tree Prediction in Pretty.gbm.tree

boostinginterpretationr

I can't figure out how to interpret the Prediction in the results of pretty.gbm.tree called on a single tree from a gbm trained on a binary outcome with the bernoulli loss function. I'm using gbm v2.1.1.

Example

data('ptitanic', package='rpart.plot') # note this is not the default data(Titanic)
ptitanic$died <- 2-as.integer(ptitanic$survived) #survived is fctr w/ 2 levels died/survived
mean(ptitanic$died) # 0.618 death rate
form <- as.formula('died ~ sex + age + pclass + sibsp + parch')

library('gbm')
set.seed(1)
m <- gbm(form,
         distribution = 'bernoulli',
         data = ptitanic,
         interaction.depth=4,
         n.trees=50)
summary(m)

mean(predict(m, ptitanic, type='response',n.trees=50)) # 0.618 death rate

# let's look at the 1st tree
t <- pretty.gbm.tree(m, i=1)
# I want to see the split variable names instead of indices 
# The indices are -1 for terminal, 0 for first term, 1 for second term, etc.
t$SplitVar <- c('Terminal',attr(terms(form),'term.labels'))[t$SplitVar+2]
# The predictions at nodes look like:
head(t$Prediction)
# [1] -2.066845e-05 -1.472631e-03 -2.374948e-03 -4.808952e-04 -1.472631e-03  7.829118e-04

What I have tried:

According to ?predict.gbm, the function will return a vector of log-odds

Returns a vector of predictions. By default the predictions are on the
scale of f(x). For example, for the Bernoulli loss the returned value
is on the log odds scale, …

So it sems like Prediction in the tree ought to be log-odds, and I should be able to get the probability estimate with:

$x = \ln{\frac{p}{1-p}}$

$p = \frac{1}{\frac{1}{e^{x}}+1}$

i.e.:

t$OR <- exp(t$Prediction)
t$Prob <- 1/(1/t$OR + 1)
head(t$Prob)  
#[1] 0.5000094 0.4996384 0.4994387 0.4998654 0.4996384 0.5002175

What is very strange is that the odds ratio at the root node is ~1, or p = 0.5 — despite the overall death rate of 61.8% mentioned above. So maybe it is trying to tell me relative risk?

There is a somewhat cryptic detail in ?predict.gbm:

The predictions from gbm do not include the offset term. The user may
add the value of the offset to the predicted value if desired.

Is adding an offset something I need to do when looking at the very first tree? (I could maybe understand the need to do additive effects in subsequent trees, but the first one?) If so, how do I do that?

Best Answer

There's a subtlety about how the gbm algorithm works that you are missing, and it's leading to your confusion.

The predict method returns predictions from the entire boosted model, and these are indeed log-odds when fitting to minimize the Bernoulli deviance. On the other hand, the predictions from the individual trees are not log odds, they are something quite different.

Indeed, while the entire model in total is fit to predict the response, the individual trees are not, the individual trees are fit to predict the gradient of the loss function evaluated at the current prediction and the response. The is the "gradient" part of "gradient boosting".

Here's a minimal example that will hopefully clarify what is going on, I'll use a booster minimizing the gaussian loss function to keep the math simple and focus on the important concepts.

x <- seq(0, 1, length.out = 6)
y <- c(0, 0, 0, 1, 1, 1)
df <- data.frame(x = x, y = y)

M <- gbm(y ~ x, data = df, 
         distribution="gaussian",
         n.trees = 1,
         bag.fraction = 1.0, n.minobsinnode = 1, shrinkage = 1.0)

t <- pretty.gbm.tree(M, i = 1)
t[, c("SplitVar", "LeftNode", "RightNode", "MissingNode", "Prediction")]

Which looks like

  SplitVar LeftNode RightNode MissingNode Prediction
0        0        1         2           3        0.0
1       -1       -1        -1          -1       -0.5
2       -1       -1        -1          -1        0.5
3       -1       -1        -1          -1        0.0

Let me break this down. In this model, we are miinimizing the following loss function:

$$ L(y, \hat y) = \frac{1}{2} (y - \hat y)^2 $$

The gradient with respect to the prediction is:

$$ \nabla L (y, \hat y) = y - \hat y $$

The tree is fit to predict the value of this function.

In detail, the model starts out at the zero'th stage predicting a constant, the mean response. In our example data, the mean response is $0.5$. Then, the gradient of the loss function is evaluated at the data and current predictions:

grad <- function(y, current_preds) {
  y - current_preds
}

grad(y, 0.5)

which results in

[1] -0.5 -0.5 -0.5  0.5  0.5  0.5

Now you can see what has happened by comparing the tree predictions to this. The tree predictions have recovered exactly the gradients.

The same thing is true in the case of a bernoulli model, though the details are more complex. The loss function being minimized is

$$ L(y, f) = y f - \log(1 + e^f) $$

Note here that $f$ is the predicted log-oods, not the probability. The gradient is

$$ \nabla L(y, f) = y - \frac{1}{1 + e^{-f}} $$

and it is this that the predictions from the trees are approximating.

In short, the predictions from the trees in a gradient booster are not interpretable by comparison to the response, you must take great care when interpreting the internal structure of a boosted model.

Related Question