Random Forest – How to Compute Out of Bag Error in Random Forest

out-of-samplerandom forest

I am implementing a Random Forest classifier as a side-project, and I am a bit unclear on what the correct approach is to compute, say, the OOB estimate for the classifier error rate.

My understanding is that typically, for each tree in the forest, one creates a training sample from the original sample by taking Examples with replacement (so some repetitions and omissions are possible), and the omitted samples can be used to compute out-of-bag estimates. The part I am unclear about is how to aggregate the errors across the different out-of-bag samples:

  1. The naive approach would be for each tree to count how many OOB examples are mis-classified, and compute the average mis-classification rate over all of them (total mis-classified / total Examples out-of-bag).

    • However, it seems to me that in essence, this would be computing the average classification error of each of the individual trees, and missing the fact that the forest is taking a majority vote over the verdict of each tree, compensating for "weaker" trees.
  2. A more complicated way would be to take each OOB Example, look up for each tree if it was included or not in the training, and take a majority vote over all trees that didn't use that Example for training.

    • This is computationally more painful, but seems like it would properly take into account the fact that a forest is more than the sum of its parts.

Can anybody enlighten me on which of the two approaches is the correct way to approach this, or, if neither is correct, what I should be doing instead?

Best Answer

You second last paragraph is the correct answer. As you say, this is the estimate that uses the whole ensemble, but never uses any data that was used to construct the trees making the individual predictions.

Related Question