Solved – How to interpret prediction output in GBM() in R for classification problem

adaboostboostingclassification

I created a model using the gbm() function in library(gbm). Within the gbm() function, I set the distribution as "adaboost". I have a binary response [0, 1]. I used the predict.gbm function for prediction, but the output is not [0, 1], but real numbers that are both negative and positive.

If within my predict.gbm function, I set type = "response", I believe I get the probability of Y | X = 1. Is this correct?

If I do not set type = "response" what are those values? How would I manually convert it to probabilities?

https://cran.r-project.org/web/packages/gbm/gbm.pdf

Best Answer

According to gbm's reference manual: While indeed type="response" then gbm converts back to the same scale as the outcome this currently will be returning probabilities for bernoulli and expected counts for poisson only. For all other distributions response and link return the same. That said, gbm.predict will indeed transform the response when the assumed cost function is Adaboost.

Here is a small R example about it:

rm(list=ls())
data(abalone, package = "AppliedPredictiveModeling") 
library(data.table)  
library(gbm)
setDT(abalone)  
K = 3000
set.seed(3)
gbm_ber <- gbm(as.numeric("M"==Type) ~ Diameter + WholeWeight + ShellWeight + Height, 
               data=abalone[1:K,], distribution="bernoulli")
gbm_ada <- gbm(as.numeric("M"==Type) ~ Diameter + WholeWeight + ShellWeight + Height, 
               data=abalone[1:K,], distribution="adaboost")

par(mfrow=c(2,2))
plot(density(predict(gbm_ber, newdata=abalone[-c(1:K),], n.trees=100, type="link")),
     main="Bernoulli - Link")
plot(density(predict(gbm_ber, newdata=abalone[-c(1:K),], n.trees= 100, type="response")),
     main="Bernoulli - Response")
plot(density(predict(gbm_ada, newdata=abalone[-c(1:K),], n.trees=100, type="link")), 
     main="Adaboost - Link")
plot(density(predict(gbm_ada, newdata=abalone[-c(1:K),], n.trees=100, type="response")), 
     main="Adaboost - Response")

As we can see the response predictions are indeed between $[0,1]$ for Adaboost too. The manually conversion to probabilities between the link predictions and response predictions in the case of Bernoulli requires the using the inverse logit: $\frac{1}{1+exp(-x)}$ while for the Adaboost is $\frac{1}{1+exp(-2x)}$. The $2$ comes directly from the inversion of $\frac{1}{2} \ln(\frac{1-\epsilon}{\epsilon})$, see Schapire (2013) "Explaining Adaboost" for more details. We can see the the manual calculations here:

predict(gbm_ada, newdata=abalone[K+c(1:4),], n.trees=100, type="response")
# [1] 0.4851037 0.4913199 0.4791932 0.5135091
predict(gbm_ber, newdata=abalone[K+c(1:4),], n.trees=100, type="response")
# [1] 0.4730339 0.4943825 0.5105203 0.5063024

1/(1+exp(-2*predict(gbm_ada, newdata=abalone[K+c(1:4),], n.trees=100, type="link")))
# [1] 0.4851037 0.4913199 0.4791932 0.5135091
1/(1+exp(-1*predict(gbm_ber, newdata=abalone[K+c(1:4),], n.trees=100, type="link")))
# [1] 0.4730339 0.4943825 0.5105203 0.5063024

Purpose of the tolerance threshold

The paper (Elith, Leathwick & Hastie 2008) states (p. 807) that the function gbm.step implements cross-validation to determine the optimal number of trees (as detailed in Figure 4, which I will paste here since the PDF treats it as an image, not text.)

The tolerance threshold helps determine when "the average of the more recent set is higher than the average of the previous set" (step 5).

The source code for gbm.step (line 159) shows that the algorithm will continue to build trees while (delta.deviance > tolerance.test & n.fitted < max.trees).

tolerance.test is the tolerance threshold
delta.deviance is defined as default of 1 (line 150), which will never fall below the tolerance threshold, but when at least 20 trees have been built, then:

(on line 220)

 if (j >= 20) {
   test1 <- mean(cv.loss.values[(j - 9):j])
   test2 <- mean(cv.loss.values[(j - 19):(j - 9)])
   delta.deviance <- test2 - test1
 }

In other words, the reduction in means of loss functions by the most recent 10 iterations as compared to the 10 iterations before that.

It's worth noting the apparent discrepancy in the source code (v 1.1.1) which compares the current to 10th previous iterations against the 11th-20th iterations, and step 5 in the figure, which compares the current to 5th against the 6th-10th. So the code is a little more conservative in that it uses a larger window for averaging the loss function.

How is the tolerance threshold calculated?

By default, tolerance.method=auto and tolerance=0.001.

On line 77:

  mean.total.deviance <- total.deviance/n.cases
  tolerance.test <- tolerance
  if (tolerance.method == "auto") {
    tolerance.test <- mean.total.deviance * tolerance
  }

Since there is no corresponding adjustment for tolerance.method == "fixed", the algorithm would use the default or user-provided argument without adjustment.

So you can specify tolerance.test as an absolute deviance (via tolerance.method='fixed') or relative to the mean total deviance (via tolerance.method='auto').

Solved – How to machine learning models (GBM, NN etc.) be used for survival analysis

For the case of neural networks, this is a promising approach: WTTE-RNN - Less hacky churn prediction.

The essence of this method is to use a Recurrent Neural Network to predict parameters of a Weibull distribution at each time-step and optimize the network using a loss function that takes censoring into account.

The author also released his implementation on Github.

Best Answer

Related Solutions

Solved – Tolerance in boosted regression trees

Purpose of the tolerance threshold

How is the tolerance threshold calculated?

Solved – How to machine learning models (GBM, NN etc.) be used for survival analysis

Related Question