Solved – Is cross validation unnecessary for Random Forest

baggingcross-validationrandom forest

Is it fair to say Cross Validation (k-fold or otherwise) is unnecessary for Random Forest? I've read that is the case because we can look at out-of-bag performance metrics, and these are doing the same thing.

Please help me understand this in the context of RF for a classification problem. Thank you!

Best Answer

Yes, out-of-bag performance for a random forest is very similar to cross validation. Essentially what you get is leave-one-out with the surrogate random forests using fewer trees. So if done correctly, you get a slight pessimistic bias. The exact bias and variance properties will be somewhat different from externally cross validating your random forest.

Like for the cross validation, the crucial point for correctness (i.e. slight pessmistic bias, not large optistic bias) is the implicit assumption that each row of your data is an independent case. If this assumption is not met, the out-of-bag estimate will be overoptimistic (as would be a "plain" cross validation) - and in that situation it may be much easier to set up an outer cross validation that splits into independent groups than to make the random forest deal with such dependence structures.


Assuming you have this independence between rows, you can use the random forest's out-of-bag performance estimate just like the corresponding cross validation estimate: either as estimate of generalization error or for model tuning (the parameters mentioned by @horaceT or e.g. boosting). If you use it for model tuning, as always, you need another independent estimate of the final model's generalization error.

That being said, the no of trees and no of variables are reasonably easy to fix, so random forest is one of the models I consider with sample sizes that are too small for data-driven model tuning.

  • Prediction error will not increase with higher number of trees - it just won't decrease any further at some point. So you can just throw in a bit more computation time and be OK.

  • number of variates to consider in each tree will depend on your data, but IMHO isn't very critical, neither (in the sense that you can e.g. use experience on previous applications with similar data to fix it).

  • leaf size (for classification) is again typically left at 1 - this again doesn't cost generalization performance, but just computation time and memory.

Related Question