Both of the previous answers are wrong. Package GBM uses interaction.depth
parameter as a number of splits it has to perform on a tree (starting from a single node). As each split increases the total number of nodes by 3 and number of terminal nodes by 2 (node $\to$ {left node, right node, NA node}) the total number of nodes in the tree will be $3*N+1$ and the number of terminal nodes $2*N+1$. This can be verified by having a look at the output of pretty.gbm.tree
function.
The behaviour is rather misleading, as the user indeed expects the depth to be the depth of the resulting tree. It is not.
The factors, as always. Seems like the model is not using the actual value of the factor, but rather something like the position in the factor-levels.
I was able to reproduce your error with the data OrchardSprays
data(OrchardSprays)
model <- gbm(decrease ~ rowpos+colpos+treatment, data=OrchardSprays, n.trees=1000, distribution="gaussian", interaction.depth=3, bag.fraction=0.5, train.fraction=1.0, shrinkage=0.1, keep.data=TRUE)
firstrow <- OrchardSprays[1,]
str(firstrow)
manualFirstrow <- data.frame(decrease=57,rowpos=1,colpos=1,treatment="D")
str(manualFirstrow)
predict(model,newdata=firstrow,n.trees=100)
predict(model,newdata=manualFirstrow,n.trees=100)
predict(model,newdata=data.frame(decrease=57,rowpos=1,colpos=1,treatment="A"),n.trees=100)
output:
> predict(model,newdata=firstrow,n.trees=100)
[1] 50.31276
> predict(model,newdata=manualFirstrow,n.trees=100)
[1] 20.67818
> predict(model,newdata=data.frame(decrease=57,rowpos=1,colpos=1,treatment="A"),n.trees=100)
[1] 20.67818
since A has position 1 in the levels of OrchardSprays$treatment. Adding the levels to the data declaration does the trick
manualFirstrow <- data.frame(decrease=57,rowpos=1,colpos=1,treatment=factor("D",levels(OrchardSprays$treatment)))
str(manualFirstrow)
predict(model,newdata=firstrow,n.trees=100)
predict(model,newdata=manualFirstrow,n.trees=100)
output:
> predict(model,newdata=firstrow,n.trees=100)
[1] 50.31276
> predict(model,newdata=manualFirstrow,n.trees=100)
[1] 50.31276
Best Answer
I don't think you're going to find what you're looking for. First of all, there is no real concept of "OOB Predictions" for a full
gbm
fit. It does save the OOB decrease (or increase) in error after each tree, but that does not equate to an OOB prediction. Since the trees are in sequence (boosted) instead of in parallel (bagged) there is no way to get "untainted" predictions for the training data.It sound like you are actually looking for the Out-Of-Fold (OOF) predictions. Calling gbm with Cross Validation enabled will make k+1 fits, but I don't think it saves anything other than the mean cross-validation error metric at each iteration. I've moved away from using the internal cross-validation functionality for this reason. I fold it (or bag it) myself and save the predictions from the folds.
And yes, these OOF predictions are valuable if you want to see untainted predictions of the training data or if you want to ensemble a gbm with another algorithm.