Solved – Out-of-bag error estimate for boosting

boostingcross-validationdata miningmachine learningrandom forest

In Random Forest, each tree is grown in parallel on a unique boostrap sample of the data. Because each boostrap sample is expected to contain about 63% of unique observations, this leaves roughly 37% of observations out, that can be used for testing the tree.

Now, it seems that in Stochastic Gradient Boosting, there is also an $OOB_{error}$ estimate similar to the one in RF:

If bag.fraction is set to be greater than 0 (0.5 is recommended), gbm
computes an out-of-bag estimate of the improvement in predictive
performance. It evaluates the reduction in deviance on those
observations not used in selecting the next regression tree.

Source: Ridgeway (2007), section 3.3 (page 8).

I have trouble understanding how it works/is valid. Say I am adding a tree in the sequence. I am growing this tree on a random subsample of the original data set. I could test this single tree on the observations that were not used to grow it. Agreed. BUT, since Boosting is sequential, I am rather using the entire sequence of trees built so far to provide a prediction for those left-out observations. And, there is a high chance that many of the preceding trees have already seen these observations. So the model is not really being tested at each round on unseen observations like with RF, right?

So, how come this is called "out-of-bag" error estimate? To me, it does not appear to be "out" of any bag since the observations have already been seen?

Best Answer

Answering only partially (and adding a new question to your question).

The gbm implementation in R http://www.rdocumentation.org/packages/gbm/functions/gbm has two parameters to adjust some out-of-bagness.

a) train.fraction will define a proportion of the data that is used to train all trees and thus 1-train.fraction will be true OOB (out-of-bag) data.

b) bag.fraction will define the proportion of training data to be used in the creation of the next tree in the boost. Thus there may be some data that is never used for the creation of any tree and they can be truly used as OOB data.(but it is unlikely, see the question below)

Which brings me to the question. Your analysis of 37% of data as being OOB is true for only ONE tree. But the chance there will be any data that is not used in ANY tree is much smaller - $0.37^{ntrees}$ (it has to be in the OOB for all $ntree$ trees - my understanding is that each tree does its own bootstrap). So in RandomForests it should be very unlikely to be any OOB to test the forest. And yet the randomForest implementation in R (based on Breiman's original code) talks a lot about OOB (for example the result data err.rate and confusion see http://www.rdocumentation.org/packages/randomForest/functions/randomForest)

I dont know how to answer that (and I thank you (+1) for asking the question and making me realize I don't understand this aspect of randomForests). The possible solution is that there is only one bootstrap - and all trees are constructed from it - but as far as I know , it is not the case.