Random Forests – What Measure of Training Error to Report for Random Forests?

classificationmachine learningoverfittingrrandom forest

I'm currently fitting random forests for a classification problem using the randomForest package in R, and am unsure about how to report training error for these models.

My training error is close to 0% when I compute it using predictions that I get with the command:

predict(model, data=X_train)

where X_train is the training data.

In an answer to a related question, I read that one should use the out-of-bag (OOB) training error as the training error metric for random forests. This quantity is computed from predictions obtained with the command:

predict(model)

In this case, the OOB training error is much closer to the mean 10-CV test error, which is 11%.

I am wondering:

  1. Is it generally accepted to report OOB training error as the training error measure for random forests?

  2. Is it true that the traditional measure of training error is artificially low?

  3. If the traditional measure of training error is artificially low, then what two measures can I compare to check if the RF is overfitting?

Best Answer

To add to @Soren H. Welling's answer.

1. Is it generally accepted to report OOB training error as the training error measure for random forests?

No. OOB error on the trained model is not the same as training error. It can, however, serve as a measure of predictive accuracy.

2. Is it true that the traditional measure of training error is artificially low?

This is true if we are running a classification problem using default settings. The exact process is described in a forum post by Andy Liaw, who maintains the randomForest package in R, as follows:

For the most part, performance on training set is meaningless. (That's the case for most algorithms, but especially so for RF.) In the default (and recommended) setting, the trees are grown to the maximum size, which means that quite likely there's only one data point in most terminal nodes, and the prediction at the terminal nodes are determined by the majority class in the node, or the lone data point. Suppose that is the case all the time; i.e., in all trees all terminal nodes have only one data point. A particular data point would be "in-bag" in about 64% of the trees in the forest, and every one of those trees has the correct prediction for that data point. Even if all the trees where that data points are out-of-bag gave the wrong prediction, by majority vote of all trees, you still get the right answer in the end. Thus basically the perfect prediction on train set for RF is "by design".

To avoid this behavior, one can set nodesize > 1 (so that the trees are not grown to maximum size) and/or set sampsize < 0.5N (so that fewer than 50% of trees are likely to contain a given point $(x_i,y_i)$.

3. If the traditional measure of training error is artificially low, then what two measures can I compare to check if the RF is overfitting?

If we run RF with nodesize = 1 and sampsize > 0.5, then the training error of the RF will always be near 0. In this case, the only way to tell if the model is overfitting is to keep some data as an independent validation set. We can then compare the 10-CV test error (or the OOB test error) to the error on the independent validation set. If the 10-CV test error is much lower than the error on the independent validation set, then the model may be overfitting.