Solved – OOB vs CV for Random Forest

bootstrapcross-validationout-of-samplerandom forest

I know this question has been asked dozens of times, but I want to really clarify what is going on when finding the best forest using OOB Error versus CV with Accuracy. From my understanding, a Random Forest with $n$ observations, $p$ variables, and $t$ trees, the following procedures occur:

Compute RF with OOB Error:

  1. For the first tree, $t_1$, and utilizing $\sqrt(p)$ variables, the
    RF will bootstrap $q$ samples from $n$, the In-Bag Data Set, and the rest
    will be saved as the Out-of-Bag Data Set.
  2. Then the RF will train $t_1$ on $q$ samples and test with the Out-of-Bag Data Set.
  3. This test produces an OOB Error for $t_1$, $OOB_{t_1}$.
  4. Next tree, $t_2$, the RF bootstraps samples again, tests again to get $OOB_{t_2}$.
  5. Rinse and Repeat 1-4 until have all $OOB_{t_i}$.
  6. If required to tune the number of variables, utilize a different number and repeat steps 1-5. Then compare $OOB_{t_i,p}$ between trees and variables.

Compute RF with CV and Accuracy:

  1. Split data into $k$ folds.
  2. For the first tree, $t_1$, and utilizing $\sqrt(p)$ variables, the
    RF will bootstrap $q$ samples from $k-1$ folds.
  3. Then the RF will train $t_1$ on $q$ samples and test, validating the Accuracy, on the $k^{th}$ fold.
  4. Rinse and Repeat 1-3, until have all $Accuracy_{t_i}$ for that set of folds.
  5. Repeat steps 1-4, $k$ times with a different combination of folds. Then return the Accuracy for each tree as the Average Accuracy across all the folds.
  6. If required to tune the number of variables, Repeat steps 1-4 $k$ times varying $p$ variables. Then repeat step 5 until all combinations of variables and trees as required.

My understanding is that they are very similar, here. But the OOB method utilizes a smaller training/learning set. It seems like best practice is to use OOB, but I feel more comfortable with CV and Accuracy. This is because by leaving a set of data out for each tree or forest of trees instead of bootstrapping for each tree with the same original $n$ samples, we validate each tree/forest based on some data never seen versus data that may have been seen by other trees due to bootstrapping. Even though with CV, for a tree, not all the data would be used from bootstrapping, but as the number of trees increase, this, in theory, decreases. While, for OOB, we always utilize every observation. Would this be the proper way to think about this problem?

Best Answer

If you fit a random forest composed $B$ trees to your dataset, the OOB error on sample $z_j$ is computed using only those trees that did not have $z_j$ included in their bootstrapped training set. So for a given loss function $L$ the OOBE is actually:

$\epsilon^{OOB} = \frac{1}{N^{OOB}} \sum_{j=1}^{N} L(y_j, \frac{1}{|B_j|}\sum_{b \in B_j}{T(x_j | \theta_b, Z_b)} )$.

where $T( |\theta_b,Z_b)$ represents simply your b-th grown tree with parameters $\theta_b$ trained on the bootstrapped training set $Z_b$. $B_j = \{b=1,2 .. B | z_j \notin Z_b\}$ is the set of all the trees that did not contain $z_j$ in their training set. And finally $N^{OOB}$ is the number of samples that do not belong to, at least, one of the bootstrapped training sets $Z_b$ $b=1,2...B$. Meanwhile for k-fold CV:

$\epsilon^{CV} = \frac{1}{N} \sum_{j=1}^{N} L(y_j, \frac{1}{B}\sum_{b=1}^B{T(x_j | \theta_b^{-k(j)}, Z_b^{-k(j)})} )$

where $k(j)$ is the fold which does not contain sample $z_j$, $\theta_b^{-k(j)}$ are the parameters of the b-th tre at the $k(j)$ round of CV, and $Z_b^{-k(j)}$ is a bootstrap sample of the training set with the $k(j)$ fold excluded and

So meanwhile in CV the ensemble of all $B$ trees is tested against the hold-out fold, with OOB an ensemble of a subset of $|B_j|$ trees is considered for each sample $z_j$. Even though each OOB sample is evaluated on a minor number of trees $B_j <= B$, the training data this sub-ensemble collectively sees is actually more than what all the $B$ trees collectively see in CV since in this case every tree is trained on bootstrapped dataset of only $k-1$ folds of the original training set. In mathematical form:

$|\bigcup_{b=1}^{B} Z_b^{-k(j)}|<|\bigcup_{b \in B_j} Z_b|$

This implies that the OOBE estimate of the error should be less biased with respect of CV, however the fact that we are bagging less trees which collectively see overlapping training sets may cause the variance to be larger. It can actually be proved that for $B \to \infty$ the OOBE converges to the k-fold CV estimate with $k=n$ [1] which is an unbiased estimator of the error, but with usually higher variance than CV with $k<n$.
The advantage of OOBE is computational: basically you can train and get a hold out error estimate (almost equivalent to n-fold CV) at the same time, avoiding all the burden of re-training your random forest for every CV fold.

[1] Hastie, T., Tibshirani, R., Friedman, J. (2001). The Elements of Statistical Learning.