Solved – Does the ‘fit’ attribute of a gbm object contain the OOB estimates

boosting

I have looked at the help (?gbm) and documentation for gbm.object, it only says that

fit: a vector containing the fitted values on the scale of
regression function (e.g. log-odds scale for bernoulli, log
scale for poisson)

Is it the OOB estimates like randomForest? That would seem unlikely to me since best_iter is not supplied to get it unless gbm has secretly decided which test I prefer.

Is there an efficient way to get the OOB estimates at all since I've already ran 5-fold cv to get best_iter?

Best Answer

I don't think you're going to find what you're looking for. First of all, there is no real concept of "OOB Predictions" for a full gbm fit. It does save the OOB decrease (or increase) in error after each tree, but that does not equate to an OOB prediction. Since the trees are in sequence (boosted) instead of in parallel (bagged) there is no way to get "untainted" predictions for the training data.

It sound like you are actually looking for the Out-Of-Fold (OOF) predictions. Calling gbm with Cross Validation enabled will make k+1 fits, but I don't think it saves anything other than the mean cross-validation error metric at each iteration. I've moved away from using the internal cross-validation functionality for this reason. I fold it (or bag it) myself and save the predictions from the folds.

And yes, these OOF predictions are valuable if you want to see untainted predictions of the training data or if you want to ensemble a gbm with another algorithm.

Related Solutions

Solved – What does interaction depth mean in GBM

Both of the previous answers are wrong. Package GBM uses interaction.depth parameter as a number of splits it has to perform on a tree (starting from a single node). As each split increases the total number of nodes by 3 and number of terminal nodes by 2 (node $\to$ {left node, right node, NA node}) the total number of nodes in the tree will be $3*N+1$ and the number of terminal nodes $2*N+1$. This can be verified by having a look at the output of pretty.gbm.tree function.

The behaviour is rather misleading, as the user indeed expects the depth to be the depth of the resulting tree. It is not.

Solved – Why does GBM predict different values for the same data

The factors, as always. Seems like the model is not using the actual value of the factor, but rather something like the position in the factor-levels.

I was able to reproduce your error with the data OrchardSprays

data(OrchardSprays)

model <- gbm(decrease ~ rowpos+colpos+treatment, data=OrchardSprays, n.trees=1000, distribution="gaussian", interaction.depth=3, bag.fraction=0.5, train.fraction=1.0, shrinkage=0.1, keep.data=TRUE)

firstrow <- OrchardSprays[1,]
str(firstrow)

manualFirstrow <- data.frame(decrease=57,rowpos=1,colpos=1,treatment="D")
str(manualFirstrow)

predict(model,newdata=firstrow,n.trees=100)
predict(model,newdata=manualFirstrow,n.trees=100)
predict(model,newdata=data.frame(decrease=57,rowpos=1,colpos=1,treatment="A"),n.trees=100)

output:

> predict(model,newdata=firstrow,n.trees=100)
[1] 50.31276
> predict(model,newdata=manualFirstrow,n.trees=100)
[1] 20.67818
> predict(model,newdata=data.frame(decrease=57,rowpos=1,colpos=1,treatment="A"),n.trees=100)
[1] 20.67818

since A has position 1 in the levels of OrchardSprays$treatment. Adding the levels to the data declaration does the trick

manualFirstrow <- data.frame(decrease=57,rowpos=1,colpos=1,treatment=factor("D",levels(OrchardSprays$treatment)))
str(manualFirstrow)

predict(model,newdata=firstrow,n.trees=100)
predict(model,newdata=manualFirstrow,n.trees=100)

output:

> predict(model,newdata=firstrow,n.trees=100)
[1] 50.31276
> predict(model,newdata=manualFirstrow,n.trees=100)
[1] 50.31276

Best Answer

Related Solutions

Solved – What does interaction depth mean in GBM

Solved – Why does GBM predict different values for the same data

Related Question