Random Forests – What Measure of Training Error to Report for Random Forests?

classificationmachine learningoverfittingrrandom forest

I'm currently fitting random forests for a classification problem using the randomForest package in R, and am unsure about how to report training error for these models.

My training error is close to 0% when I compute it using predictions that I get with the command:

predict(model, data=X_train)

where X_train is the training data.

In an answer to a related question, I read that one should use the out-of-bag (OOB) training error as the training error metric for random forests. This quantity is computed from predictions obtained with the command:

predict(model)

In this case, the OOB training error is much closer to the mean 10-CV test error, which is 11%.

I am wondering:

Is it generally accepted to report OOB training error as the training error measure for random forests?
Is it true that the traditional measure of training error is artificially low?
If the traditional measure of training error is artificially low, then what two measures can I compare to check if the RF is overfitting?

Best Answer

To add to @Soren H. Welling's answer.

1. Is it generally accepted to report OOB training error as the training error measure for random forests?

No. OOB error on the trained model is not the same as training error. It can, however, serve as a measure of predictive accuracy.

2. Is it true that the traditional measure of training error is artificially low?

This is true if we are running a classification problem using default settings. The exact process is described in a forum post by Andy Liaw, who maintains the randomForest package in R, as follows:

For the most part, performance on training set is meaningless. (That's the case for most algorithms, but especially so for RF.) In the default (and recommended) setting, the trees are grown to the maximum size, which means that quite likely there's only one data point in most terminal nodes, and the prediction at the terminal nodes are determined by the majority class in the node, or the lone data point. Suppose that is the case all the time; i.e., in all trees all terminal nodes have only one data point. A particular data point would be "in-bag" in about 64% of the trees in the forest, and every one of those trees has the correct prediction for that data point. Even if all the trees where that data points are out-of-bag gave the wrong prediction, by majority vote of all trees, you still get the right answer in the end. Thus basically the perfect prediction on train set for RF is "by design".

To avoid this behavior, one can set nodesize > 1 (so that the trees are not grown to maximum size) and/or set sampsize < 0.5N (so that fewer than 50% of trees are likely to contain a given point $(x_i,y_i)$.

3. If the traditional measure of training error is artificially low, then what two measures can I compare to check if the RF is overfitting?

If we run RF with nodesize = 1 and sampsize > 0.5, then the training error of the RF will always be near 0. In this case, the only way to tell if the model is overfitting is to keep some data as an independent validation set. We can then compare the 10-CV test error (or the OOB test error) to the error on the independent validation set. If the 10-CV test error is much lower than the error on the independent validation set, then the model may be overfitting.

Related Solutions

Solved – Does party package in R provide out-of-bag estimates of error for Random Forest models

The caret package has a method for getting that. You can use train as the interface. For example:

> mod1 <- train(Species ~ ., 
+               data = iris, 
+               method = "cforest", 
+               tuneGrid = data.frame(.mtry = 2),
+               trControl = trainControl(method = "oob"))
> mod1
150 samples
  4 predictors
  3 classes: 'setosa', 'versicolor', 'virginica' 

No pre-processing
Resampling: 

Summary of sample sizes:  

Resampling results

  Accuracy  Kappa
  0.967     0.95 

Tuning parameter 'mtry' was held constant at a value of 2

Alternatively, there is an internal function that can be used if you want to go straight to cforest but you have to call it using the namespace operator:

> mod2 <- cforest(Species ~ ., data = iris,
+                 controls = cforest_unbiased(mtry = 2))
> caret:::cforestStats(mod2)
 Accuracy     Kappa 
0.9666667 0.9500000

HTH,

Max

Random Forest – Understanding Out-of-Bag Sample Size in Random Forests

It comes from the construction of a bootstrap sample: you're sampling $n$ observations with replacement to a sample size of $n$. The probability that an observation is omitted is $(1-\frac{1}{n})^n.$* Now consider the definition of $\exp(-1)=\lim \limits_{n\to\infty}(1-\frac{1}{n})^n$ and observe that $\exp(-1)=0.3678...\approx\frac{1}{3}.$

*To verify this, I will define the probability space of the bootstrap: $\Omega=\{x_1, x_2, x_3, \dots, x_n\}$ where each $x_{i\in I=\{i\in\mathbb{N}:i\le n\}}$ is an observation, $\mathcal{F}=2^\Omega$. We will denote the boostrap sample as $B$. Note that we can take this $\sigma$-field $\mathcal{F}$ because we must have a finite number of observations. Collecting our bootstrap sample one observation at a time, our event of interest $E$ occurs when some observation $x_i$ is selected for the bootstrap sample, and we must define a probability measure for it. That is, $P(E)=P(\{x_i \in B\})$.

We can think of drawing a bootstrap sample as an experiment where there are $n$ trials. Each trial is selecting one of our observations uniformly at random with replacement, so it will either include $x_i$ with probability $P(E)=\frac{|E|}{|\Omega|}=\frac{1}{n},$ or exclude $x_i$ with probability $P(E^c)=\frac{|\Omega|-|E|}{|\Omega|}=1-\frac{1}{n}.$ Our probability space $(\Omega, \mathcal{F}, P)$ is now completely defined. The experiment we're performing has $n$ trials, so the probability that $x_i$ is omitted from all of them is $(\bigcup_{i=1}^n P(E))^c=\bigcap_{i=1}^n P(E^c)=(1-\frac{1}{n})^n$.

Best Answer

Related Solutions

Solved – Does party package in R provide out-of-bag estimates of error for Random Forest models

Random Forest – Understanding Out-of-Bag Sample Size in Random Forests

Related Question