Solved – GBM Bootstrap Prediction Interval Code Error

boostingbootstrapcaretprediction interval

based on code presented in thread:
How to find a GBM Prediction Interval

I am trying to apply this to my dataset. Below is my full code, and I am having issues with the bootstrap function.

library(caret)
require(foreign)

set.seed(825)
Ridership <- read.spss("V:/Metro/Coverage/ROUTE_MODEL2.sav",use.value.labels=TRUE, to.data.frame = TRUE)

set.seed(825)
fitControl <- trainControl(method = "cv", number = 2)
gbmGrid <-  expand.grid(interaction.depth = (20:21), n.trees = (750), shrinkage = c(0.07))
x <- Ridership[, -148]
y <- Ridership[, 148]

gbmFit <- train(x=x,y=y,"gbm", tuneGrid = gbmGrid, n.minobsinnode = 2, trControl =fitControl, verbose=FALSE)
gbmFit
x.pt <- quantile(Ridership$TOT_RIDERSHIP, c(0.25, 0.5, 0.75))
p <- plot(gbmFit, newdata = Ridership[, -148], grid.levels = x.pt, return.grid = TRUE)
p
library(boot)
bootfun <- function(data, indices) {
  data <- data[indices,]
  x <- Ridership[, -148]
  y <- Ridership[, 148]
  gbmFit <- train(x=x,y=y,"gbm", tuneGrid = gbmGrid, n.minobsinnode = 2, trControl =fitControl, verbose=FALSE)
  plot(gbmFit, newdata = Ridership[, -148], grid.levels = x.pt, return.grid = TRUE)$y
}
b <- boot(data = Ridership, statistic = bootfun, R = 5) 
lims <- t(apply(b$t, 2, FUN = function(x) quantile(x, c(0.025, 0.975))))

When I run the code, the lim(only show 1, and nothing more. I am not exactly sure what to define in the Bootstrap function. I have flipped through the bootstrap package code, but it still is not clear to me what I am doing wrong. Thanks in advance!

Best Answer

Hard to say without a reproducible example but some pointers based on what I can understand from the code:

For plot.gbm you need to pass a gbm object as well as the variable to plot. In your case something like p <- plot(gbmFit$finalModel, i.var = ..., grid.levels = x.pt, ...), where i.var is the variable you want partial dependencies for. The length defaults to 100, see ?plot.gbm and I think that is why you get 100 points. The presence of grid.levels should override this if plot.gbm is called correctly.
2-fold CV is not likely to give a good estimate of performance, and with 75 data points I would use the bootstrap (or maybe leave one out CV) and restrict the trees to be of depth 1--3 (or so) unless you have very strong prior knowledge that you require depth 20 or 21 trees.

The book Applied Predictive Modeling by Max Kuhn and Kjell Johnson is a great go-to source for tuning gbm ensembles (and tuning predictive models in general).

Also note that this code gives confidence intervals for predicted values instead of prediction intervals as pointed out by the comments in the above mentioned thread.

Hope this helps!

Related Solutions

Solved – How to find a GBM Prediction Interval

EDIT: As pointed out in the comments below this gives the confidence intervals for predictions and not strictly the prediction intervals. Was a bit trigger happy with my reply and should have given this some extra thought.

Feel free to ignore this answer or try to build on the code to get the prediction intervals.

I have used the simple bootstrap for creating prediction intervals a few times but there may be other (better) ways.

Consider the oil data in the caret package and suppose we want to generate partial dependencies and 95% intervals for the effect of Stearic on Palmitic. Below is just a simple example but you can play around with it to suit your needs. Make sure the gbm package is update to allow the grid.points argument in plot.gbm

library(caret)
data(oil)
#train the gbm using just the defaults.
tr <- train(Palmitic ~ ., method = "gbm" ,data = fattyAcids, verbose = FALSE)

#Points to be used for prediction. Use the quartiles here just for illustration
x.pt <- quantile(fattyAcids$Stearic, c(0.25, 0.5, 0.75))

#Generate the predictions, or in this case, the partial dependencies at the selected points. Substitute plot() for predict() to get predictions
p <- plot(tr$finalModel, "Stearic", grid.levels = x.pt, return.grid = TRUE)

#Bootstrap the process to get prediction intervals
library(boot)

bootfun <- function(data, indices) {
  data <- data[indices,]

  #As before, just the defaults in this example. Palmitic is the first variable, hence data[,1]
  tr <- train(data[,-1], data[,1], method = "gbm", verbose=FALSE)

  # ... other steps, e.g. using the oneSE rule etc ...
  #Return partial dependencies (or predictions)

  plot(tr$finalModel, "Stearic", grid.levels = x.pt, return.grid = TRUE)$y
  #or predict(tr$finalModel, data = ...)
}

#Perform the bootstrap, this can be very time consuming. Just 99 replicates here but we usually want to do more, e.g. 500. Consider using the parallel option
b <- boot(data = fattyAcids, statistic = bootfun, R = 99)

#Get the 95% intervals from the boot object as the 2.5th and 97.5th percentiles
lims <- t(apply(b$t, 2, FUN = function(x) quantile(x, c(0.025, 0.975))))

This is one way to do it which at least try to account for the uncertainties arising from tuning the gbm. A similar approach has been used in http://onlinelibrary.wiley.com/doi/10.2193/2006-503/abstract

Sometimes the point estimate is outside the interval, but modifying the tuning grid (i.e., increasing the number of trees and/or the depth) usually solves that.

Hope this helps!

Solved – Getting prediction intervals from GBM models

Yes. H2O's implementation seems like a more robust, distributed version of the GBM model offered by sklearn. If you look into the documentation provided by sklearn they offer the following option for the loss function:

Quantile ('quantile'): A loss function for quantile regression. Use 0 < alpha < 1 to specify the quantile. This loss function can be used to create prediction intervals

They also offer an example on how quantile regression can be used to create prediction intervals.

Best Answer

Related Solutions

Solved – How to find a GBM Prediction Interval

Solved – Getting prediction intervals from GBM models

Related Question