Solved – GBM Bootstrap Prediction Interval Code Error

boostingbootstrapcaretprediction interval

based on code presented in thread:
How to find a GBM Prediction Interval

I am trying to apply this to my dataset. Below is my full code, and I am having issues with the bootstrap function.

library(caret)
require(foreign)

set.seed(825)
Ridership <- read.spss("V:/Metro/Coverage/ROUTE_MODEL2.sav",use.value.labels=TRUE, to.data.frame = TRUE)

set.seed(825)
fitControl <- trainControl(method = "cv", number = 2)
gbmGrid <-  expand.grid(interaction.depth = (20:21), n.trees = (750), shrinkage = c(0.07))
x <- Ridership[, -148]
y <- Ridership[, 148]

gbmFit <- train(x=x,y=y,"gbm", tuneGrid = gbmGrid, n.minobsinnode = 2, trControl =fitControl, verbose=FALSE)
gbmFit
x.pt <- quantile(Ridership$TOT_RIDERSHIP, c(0.25, 0.5, 0.75))
p <- plot(gbmFit, newdata = Ridership[, -148], grid.levels = x.pt, return.grid = TRUE)
p
library(boot)
bootfun <- function(data, indices) {
  data <- data[indices,]
  x <- Ridership[, -148]
  y <- Ridership[, 148]
  gbmFit <- train(x=x,y=y,"gbm", tuneGrid = gbmGrid, n.minobsinnode = 2, trControl =fitControl, verbose=FALSE)
  plot(gbmFit, newdata = Ridership[, -148], grid.levels = x.pt, return.grid = TRUE)$y
}
b <- boot(data = Ridership, statistic = bootfun, R = 5) 
lims <- t(apply(b$t, 2, FUN = function(x) quantile(x, c(0.025, 0.975))))

When I run the code, the lim(only show 1, and nothing more. I am not exactly sure what to define in the Bootstrap function. I have flipped through the bootstrap package code, but it still is not clear to me what I am doing wrong. Thanks in advance!

Best Answer

Hard to say without a reproducible example but some pointers based on what I can understand from the code:

  • For plot.gbm you need to pass a gbm object as well as the variable to plot. In your case something like p <- plot(gbmFit$finalModel, i.var = ..., grid.levels = x.pt, ...), where i.var is the variable you want partial dependencies for. The length defaults to 100, see ?plot.gbm and I think that is why you get 100 points. The presence of grid.levels should override this if plot.gbm is called correctly.
  • 2-fold CV is not likely to give a good estimate of performance, and with 75 data points I would use the bootstrap (or maybe leave one out CV) and restrict the trees to be of depth 1--3 (or so) unless you have very strong prior knowledge that you require depth 20 or 21 trees.

The book Applied Predictive Modeling by Max Kuhn and Kjell Johnson is a great go-to source for tuning gbm ensembles (and tuning predictive models in general).

Also note that this code gives confidence intervals for predicted values instead of prediction intervals as pointed out by the comments in the above mentioned thread.

Hope this helps!