Solved – Does modeling with Random Forests require cross-validation

cross-validationout-of-sampleoverfittingrandom forest

As far as I've seen, opinions tend to differ about this. Best practice would certainly dictate using cross-validation (especially if comparing RFs with other algorithms on the same dataset). On the other hand, the original source states that the fact OOB error is calculated during model training is enough of an indicator of test set performance. Even Trevor Hastie, in a relatively recent talks says that "Random Forests provide free cross-validation". Intuitively, this makes sense to me, if training and trying to improve a RF-based model on one dataset.

What's your opinion on this?

Best Answer

The OOB error is calculated by for each observation using only the trees that did not have this particular observation in their bootstrap sample; see this related question. This is very roughly equivalent to two-fold cross validation as the probability of a particular observation being in a particular bootstrap sample is $1-(1-\frac{1}{N})^N \approx 1-e^{-1} \approx 0.6$.

As @Wouter points out, you will probably want to do cross validation for parameter tuning, but as an estimate of test set error the OOB error should be fine.