Solved – Generating predictions on training data in GBM regression

boostingmachine learningrregression

I am fitting a GBM based regression model in R with a Gaussian loss function. The problem I face is that after fitting the model, the predicted values generated on the training dataset do not exhibit a lot of variation i.e. Q1,Q2 and Q3 are almost the same. However, the predictions generated by the same model on the test data seem to well "spread-out".

Just to be thorough, I also ran a linear regression and generated predictions on the same training data to test the variability in predictions. The predictions seem to be well "spread out".

I am not sure if I am generating predictions from gbm correctly.

Here is an example using the mtcars dataset for generating predictions on the training data using both gbm and lm-

library(gbm)
# load mtcars data
data(mtcars)
# fit GBM
gbmFit2<-gbm(mpg~cyl+disp+hp+wt+qsec,
             data=mtcars,
             distribution = "gaussian",
             interaction.depth=3,
             bag.fraction=0.7,
             n.trees = 50)
# generate predictions
p1<-predict(gbmFit2,n.trees=50)
# summary of actual values
summary(mtcars$mpg)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  10.40   15.42   19.20   20.09   22.80   33.90 
# summary of predictions from GBM
summary(p1)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  19.88   19.88   20.03   20.09   20.33   20.33 
# linear regression
regFit2<- lm(mpg~cyl+wt,data=mtcars)
# summary of predictions from linear regression
summary(predict(regFit2))
  Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 10.32   16.10   19.66   20.09   25.04   28.83

Best Answer

The issue is the number of trees you are fitting in relation to your learning rate.

You do not provide a learning rate to your booster (called shrinkage in the R library), so the model assumes the default of $0.001$. This means you are fitting 50 trees, and the contribution of each is shrunk by $0.001$, so you're really only getting about $0.05$ of a tree.

You need to fit many, many more trees when using boosting.

library(gbm)
data(mtcars)

M <- gbm(mpg~cyl+disp+hp+wt+qsec,
         data=mtcars,
         distribution = "gaussian",
         interaction.depth=3,
         bag.fraction=0.7,
         n.trees = 10000)

p <- predict(M, n.trees = 10000)
summary(p)

Results in

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  13.24   15.19   18.97   20.09   25.93   26.86

To tune the appropriate number of trees, you should fit many more than you think are neccessary, then use cross validation to assess the optimal number to use for predictions. Here's an example you can running using your data.

M <- gbm(mpg~cyl+disp+hp+wt+qsec,
         data=mtcars,
         distribution = "gaussian",
         interaction.depth=2,
         n.minobsinnode = 2,
         bag.fraction=1.0,
         n.trees = 50000,
         cv.folds=3)

gbm.perf(M)

Given the very small size of your data set, it would be a good idea to bootstrap this entire cross-validation process many, many times to assess the stability of your decisions.

Related Solutions

Solved – Why does GBM predict different values for the same data

The factors, as always. Seems like the model is not using the actual value of the factor, but rather something like the position in the factor-levels.

I was able to reproduce your error with the data OrchardSprays

data(OrchardSprays)

model <- gbm(decrease ~ rowpos+colpos+treatment, data=OrchardSprays, n.trees=1000, distribution="gaussian", interaction.depth=3, bag.fraction=0.5, train.fraction=1.0, shrinkage=0.1, keep.data=TRUE)

firstrow <- OrchardSprays[1,]
str(firstrow)

manualFirstrow <- data.frame(decrease=57,rowpos=1,colpos=1,treatment="D")
str(manualFirstrow)

predict(model,newdata=firstrow,n.trees=100)
predict(model,newdata=manualFirstrow,n.trees=100)
predict(model,newdata=data.frame(decrease=57,rowpos=1,colpos=1,treatment="A"),n.trees=100)

output:

> predict(model,newdata=firstrow,n.trees=100)
[1] 50.31276
> predict(model,newdata=manualFirstrow,n.trees=100)
[1] 20.67818
> predict(model,newdata=data.frame(decrease=57,rowpos=1,colpos=1,treatment="A"),n.trees=100)
[1] 20.67818

since A has position 1 in the levels of OrchardSprays$treatment. Adding the levels to the data declaration does the trick

manualFirstrow <- data.frame(decrease=57,rowpos=1,colpos=1,treatment=factor("D",levels(OrchardSprays$treatment)))
str(manualFirstrow)

predict(model,newdata=firstrow,n.trees=100)
predict(model,newdata=manualFirstrow,n.trees=100)

output:

> predict(model,newdata=firstrow,n.trees=100)
[1] 50.31276
> predict(model,newdata=manualFirstrow,n.trees=100)
[1] 50.31276

Solved – Fitting data to gamma distribution to find score which corresponds to pvalue < 0.05

One simple way to fit a gamma distribution to the data is the method of moments: the gamma distribution with parameters $(\alpha, \beta)$ has mean $\frac\alpha\beta$ and variance $\frac\alpha{\beta^2}$. You can use sample estimates of the mean and variance and some algebra to solve for the parameters of the model. Naturally, more advanced methods exist. But with the large number of observations, this should be a good starting point.
I'm uncertain what your second question means. Are you looking for the score which divides the smallest 95% of your data from the largest 5%?

Best Answer

Related Solutions

Solved – Why does GBM predict different values for the same data

Solved – Fitting data to gamma distribution to find score which corresponds to pvalue < 0.05

Related Question