Solved – Generating predictions on training data in GBM regression

boostingmachine learningrregression

I am fitting a GBM based regression model in R with a Gaussian loss function. The problem I face is that after fitting the model, the predicted values generated on the training dataset do not exhibit a lot of variation i.e. Q1,Q2 and Q3 are almost the same. However, the predictions generated by the same model on the test data seem to well "spread-out".

Just to be thorough, I also ran a linear regression and generated predictions on the same training data to test the variability in predictions. The predictions seem to be well "spread out".

I am not sure if I am generating predictions from gbm correctly.

Here is an example using the mtcars dataset for generating predictions on the training data using both gbm and lm-

library(gbm)
# load mtcars data
data(mtcars)
# fit GBM
gbmFit2<-gbm(mpg~cyl+disp+hp+wt+qsec,
             data=mtcars,
             distribution = "gaussian",
             interaction.depth=3,
             bag.fraction=0.7,
             n.trees = 50)
# generate predictions
p1<-predict(gbmFit2,n.trees=50)
# summary of actual values
summary(mtcars$mpg)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  10.40   15.42   19.20   20.09   22.80   33.90 
# summary of predictions from GBM
summary(p1)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  19.88   19.88   20.03   20.09   20.33   20.33 
# linear regression
regFit2<- lm(mpg~cyl+wt,data=mtcars)
# summary of predictions from linear regression
summary(predict(regFit2))
  Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 10.32   16.10   19.66   20.09   25.04   28.83 

Best Answer

The issue is the number of trees you are fitting in relation to your learning rate.

You do not provide a learning rate to your booster (called shrinkage in the R library), so the model assumes the default of $0.001$. This means you are fitting 50 trees, and the contribution of each is shrunk by $0.001$, so you're really only getting about $0.05$ of a tree.

You need to fit many, many more trees when using boosting.

library(gbm)
data(mtcars)

M <- gbm(mpg~cyl+disp+hp+wt+qsec,
         data=mtcars,
         distribution = "gaussian",
         interaction.depth=3,
         bag.fraction=0.7,
         n.trees = 10000)

p <- predict(M, n.trees = 10000)
summary(p)

Results in

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  13.24   15.19   18.97   20.09   25.93   26.86 

To tune the appropriate number of trees, you should fit many more than you think are neccessary, then use cross validation to assess the optimal number to use for predictions. Here's an example you can running using your data.

M <- gbm(mpg~cyl+disp+hp+wt+qsec,
         data=mtcars,
         distribution = "gaussian",
         interaction.depth=2,
         n.minobsinnode = 2,
         bag.fraction=1.0,
         n.trees = 50000,
         cv.folds=3)

gbm.perf(M)

Given the very small size of your data set, it would be a good idea to bootstrap this entire cross-validation process many, many times to assess the stability of your decisions.