GBM Tree Prediction – Interpretation of Single Tree Prediction in Pretty.gbm.tree

boostinginterpretationr

I can't figure out how to interpret the Prediction in the results of pretty.gbm.tree called on a single tree from a gbm trained on a binary outcome with the bernoulli loss function. I'm using gbm v2.1.1.

Example

data('ptitanic', package='rpart.plot') # note this is not the default data(Titanic)
ptitanic$died <- 2-as.integer(ptitanic$survived) #survived is fctr w/ 2 levels died/survived
mean(ptitanic$died) # 0.618 death rate
form <- as.formula('died ~ sex + age + pclass + sibsp + parch')

library('gbm')
set.seed(1)
m <- gbm(form,
         distribution = 'bernoulli',
         data = ptitanic,
         interaction.depth=4,
         n.trees=50)
summary(m)

mean(predict(m, ptitanic, type='response',n.trees=50)) # 0.618 death rate

# let's look at the 1st tree
t <- pretty.gbm.tree(m, i=1)
# I want to see the split variable names instead of indices 
# The indices are -1 for terminal, 0 for first term, 1 for second term, etc.
t$SplitVar <- c('Terminal',attr(terms(form),'term.labels'))[t$SplitVar+2]
# The predictions at nodes look like:
head(t$Prediction)
# [1] -2.066845e-05 -1.472631e-03 -2.374948e-03 -4.808952e-04 -1.472631e-03  7.829118e-04

What I have tried:

According to ?predict.gbm, the function will return a vector of log-odds

Returns a vector of predictions. By default the predictions are on the
scale of f(x). For example, for the Bernoulli loss the returned value
is on the log odds scale, …

So it sems like Prediction in the tree ought to be log-odds, and I should be able to get the probability estimate with:

$x = \ln{\frac{p}{1-p}}$

$p = \frac{1}{\frac{1}{e^{x}}+1}$

i.e.:

t$OR <- exp(t$Prediction)
t$Prob <- 1/(1/t$OR + 1)
head(t$Prob)  
#[1] 0.5000094 0.4996384 0.4994387 0.4998654 0.4996384 0.5002175

What is very strange is that the odds ratio at the root node is ~1, or p = 0.5 — despite the overall death rate of 61.8% mentioned above. So maybe it is trying to tell me relative risk?

There is a somewhat cryptic detail in ?predict.gbm:

The predictions from gbm do not include the offset term. The user may
add the value of the offset to the predicted value if desired.

Is adding an offset something I need to do when looking at the very first tree? (I could maybe understand the need to do additive effects in subsequent trees, but the first one?) If so, how do I do that?

Best Answer

There's a subtlety about how the gbm algorithm works that you are missing, and it's leading to your confusion.

The predict method returns predictions from the entire boosted model, and these are indeed log-odds when fitting to minimize the Bernoulli deviance. On the other hand, the predictions from the individual trees are not log odds, they are something quite different.

Indeed, while the entire model in total is fit to predict the response, the individual trees are not, the individual trees are fit to predict the gradient of the loss function evaluated at the current prediction and the response. The is the "gradient" part of "gradient boosting".

Here's a minimal example that will hopefully clarify what is going on, I'll use a booster minimizing the gaussian loss function to keep the math simple and focus on the important concepts.

x <- seq(0, 1, length.out = 6)
y <- c(0, 0, 0, 1, 1, 1)
df <- data.frame(x = x, y = y)

M <- gbm(y ~ x, data = df, 
         distribution="gaussian",
         n.trees = 1,
         bag.fraction = 1.0, n.minobsinnode = 1, shrinkage = 1.0)

t <- pretty.gbm.tree(M, i = 1)
t[, c("SplitVar", "LeftNode", "RightNode", "MissingNode", "Prediction")]

Which looks like

  SplitVar LeftNode RightNode MissingNode Prediction
0        0        1         2           3        0.0
1       -1       -1        -1          -1       -0.5
2       -1       -1        -1          -1        0.5
3       -1       -1        -1          -1        0.0

Let me break this down. In this model, we are miinimizing the following loss function:

$$ L(y, \hat y) = \frac{1}{2} (y - \hat y)^2 $$

The gradient with respect to the prediction is:

$$ \nabla L (y, \hat y) = y - \hat y $$

The tree is fit to predict the value of this function.

In detail, the model starts out at the zero'th stage predicting a constant, the mean response. In our example data, the mean response is $0.5$. Then, the gradient of the loss function is evaluated at the data and current predictions:

grad <- function(y, current_preds) {
  y - current_preds
}

grad(y, 0.5)

which results in

[1] -0.5 -0.5 -0.5  0.5  0.5  0.5

Now you can see what has happened by comparing the tree predictions to this. The tree predictions have recovered exactly the gradients.

The same thing is true in the case of a bernoulli model, though the details are more complex. The loss function being minimized is

$$ L(y, f) = y f - \log(1 + e^f) $$

Note here that $f$ is the predicted log-oods, not the probability. The gradient is

$$ \nabla L(y, f) = y - \frac{1}{1 + e^{-f}} $$

and it is this that the predictions from the trees are approximating.

In short, the predictions from the trees in a gradient booster are not interpretable by comparison to the response, you must take great care when interpreting the internal structure of a boosted model.

Related Solutions

Solved – How to find a GBM Prediction Interval

EDIT: As pointed out in the comments below this gives the confidence intervals for predictions and not strictly the prediction intervals. Was a bit trigger happy with my reply and should have given this some extra thought.

Feel free to ignore this answer or try to build on the code to get the prediction intervals.

I have used the simple bootstrap for creating prediction intervals a few times but there may be other (better) ways.

Consider the oil data in the caret package and suppose we want to generate partial dependencies and 95% intervals for the effect of Stearic on Palmitic. Below is just a simple example but you can play around with it to suit your needs. Make sure the gbm package is update to allow the grid.points argument in plot.gbm

library(caret)
data(oil)
#train the gbm using just the defaults.
tr <- train(Palmitic ~ ., method = "gbm" ,data = fattyAcids, verbose = FALSE)

#Points to be used for prediction. Use the quartiles here just for illustration
x.pt <- quantile(fattyAcids$Stearic, c(0.25, 0.5, 0.75))

#Generate the predictions, or in this case, the partial dependencies at the selected points. Substitute plot() for predict() to get predictions
p <- plot(tr$finalModel, "Stearic", grid.levels = x.pt, return.grid = TRUE)

#Bootstrap the process to get prediction intervals
library(boot)

bootfun <- function(data, indices) {
  data <- data[indices,]

  #As before, just the defaults in this example. Palmitic is the first variable, hence data[,1]
  tr <- train(data[,-1], data[,1], method = "gbm", verbose=FALSE)

  # ... other steps, e.g. using the oneSE rule etc ...
  #Return partial dependencies (or predictions)

  plot(tr$finalModel, "Stearic", grid.levels = x.pt, return.grid = TRUE)$y
  #or predict(tr$finalModel, data = ...)
}

#Perform the bootstrap, this can be very time consuming. Just 99 replicates here but we usually want to do more, e.g. 500. Consider using the parallel option
b <- boot(data = fattyAcids, statistic = bootfun, R = 99)

#Get the 95% intervals from the boot object as the 2.5th and 97.5th percentiles
lims <- t(apply(b$t, 2, FUN = function(x) quantile(x, c(0.025, 0.975))))

This is one way to do it which at least try to account for the uncertainties arising from tuning the gbm. A similar approach has been used in http://onlinelibrary.wiley.com/doi/10.2193/2006-503/abstract

Sometimes the point estimate is outside the interval, but modifying the tuning grid (i.e., increasing the number of trees and/or the depth) usually solves that.

Hope this helps!

Solved – GBM Bootstrap Prediction Interval Code Error

Hard to say without a reproducible example but some pointers based on what I can understand from the code:

For plot.gbm you need to pass a gbm object as well as the variable to plot. In your case something like p <- plot(gbmFit$finalModel, i.var = ..., grid.levels = x.pt, ...), where i.var is the variable you want partial dependencies for. The length defaults to 100, see ?plot.gbm and I think that is why you get 100 points. The presence of grid.levels should override this if plot.gbm is called correctly.
2-fold CV is not likely to give a good estimate of performance, and with 75 data points I would use the bootstrap (or maybe leave one out CV) and restrict the trees to be of depth 1--3 (or so) unless you have very strong prior knowledge that you require depth 20 or 21 trees.

The book Applied Predictive Modeling by Max Kuhn and Kjell Johnson is a great go-to source for tuning gbm ensembles (and tuning predictive models in general).

Also note that this code gives confidence intervals for predicted values instead of prediction intervals as pointed out by the comments in the above mentioned thread.

Hope this helps!

Best Answer

Related Solutions

Solved – How to find a GBM Prediction Interval

Solved – GBM Bootstrap Prediction Interval Code Error

Related Question