Machine Learning – Understanding Out of Bag Error in Random Forest and Data Partitioning

cross-validationmachine learningrandom forest

I have a question concerning OOB error in random forests and data partitioning. As far as i know in random forests the trees are not pruned. Also we use OOB error for measuring the performance of the forest. Why then we should use data partitioning (training – validation) when constructing a random forest. In many cases that i have seen a data partitioning process is used. In this case how can the validation error be interpreted?

Thanks in advance,

Andreas

Best Answer

Training a model, tuning its hyperparameters, and evaluating its performance are typically done using independent training, validation, and test sets. This three-way split can take the form of holdout or nested cross validation. The independence of these sets is important because, otherwise, estimates of the error would be downwardly biased--we'd select poor models and expect them to perform better on future data than they really would. Because random forests already use bootstrapping for fitting individual tries, they readily yield the out-of-bag (OOB) error. This is an unbiased estimate of the error on future data. As such, it can take the place of the validation or test error, and is cheaper to compute than using nested cross validation.

If we had a fixed set of hyperparameters, we could train a random forest on the entire dataset, estimate performance using the OOB error, and call it a day. But, random forests have hyperparameters that may need to be tuned to balance between under- and overfitting. One of these is the number of features considered for each split. Another is tree size, which is typically controlled by limiting the depth or number of nodes when growing the tree, rather than by pruning after the fact. Rather than splitting the data into training, validation, and test sets, we can use the OOB error in place of the the validation or test set error. For example, hyperparameters could be tuned to minimize OOB error and performance could be evaluated on the test set (possibly using cross validation, with no need for nesting).