Random Forests – Out of Bag Error Versus Cross-Validation

cross-validationoverfittingrandom forest

I am fairly new to random forests. In the past, I have always compared the accuracy of fit vs test against fit vs train to detect any overfitting. But I just read here that:

"In random forests, there is no need for cross-validation or a
separate test set to get an unbiased estimate of the test set error.
It is estimated internally , during the run…"

The small paragraph above can be found under the The out-of-bag (oob) error estimate Section. This Out of Bag Error concept is completely new to me and what's a little confusing is how the OOB error in my model is 35% (or 65% accuracy), but yet, if I apply cross validation to my data (just a simple holdout method) and compare both fit vs test against fit vs train I get a 65% accuracy and a 96% accuracy respectively. In my experience, this is considered overfitting but the OOB holds a 35% error just like my fit vs test error. Am I overfitting? Should I even be using cross validation to check for overfitting in random forests?

In short, I am not sure whether I should trust the OOB to get an unbiased error of the test set error when my fit vs train indicates that I am overfitting!

Best Answer

  • training error (as in predict(model, data=train)) is typically useless. Unless you do (non-standard) pruning of the trees, it cannot be much above 0 by design of the algorithm. Random forest uses bootstrap aggregation of decision trees, which are known to be overfit badly. This is like training error for a 1-nearest-neighbour classifier.

  • However, the algorithm offers a very elegant way of computing the out-of-bag error estimate which is essentially an out-of-bootstrap estimate of the aggregated model's error). The out-of-bag error is the estimated error for aggregating the predictions of the $\approx \frac{1}{e}$ fraction of the trees that were trained without that particular case.
    The models aggregated for the out-of-bag error will only be independent, if there is no dependence between the input data rows. I.e. each row = one independent case, no hierarchical data structure / no clustering / no repeated measurements.

    So the out-of-bag error is not exactly the same (less trees for aggregating, more training case copies) as a cross validation error, but for practical purposes it is close enough.

  • What would make sense to look at in order to detect overfitting is comparing out-of-bag error with an external validation. However, unless you know about clustering in your data, a "simple" cross validation error will be prone to the same optimistic bias as the out-of-bag error: the splitting is done according to very similar principles.
    You'd need to compare out-of-bag or cross validation with error for a well-designed test experiment to detect this.