Each tree in the forest is built from a bootstrap sample of the observations in your training data. Those observations in the bootstrap sample build the tree, whilst those not in the bootstrap sample form the out-of-bag (or OOB) samples.

It should be clear that the same variables are available for cases in the data used to build a tree as for the cases in the OOB sample. To get predictions for the OOB sample, each one is passed down the current tree and the rules for the tree followed until it arrives in a terminal node. That yields the OOB predictions for that particular tree.

This process is repeated a large number of times, each tree trained on a new bootstrap sample from the training data and predictions for the new OOB samples derived.

As the number of trees grows, any one sample will be in the OOB samples more than once, thus the "average" of the predictions over the N trees where a sample is in the OOB is used as the OOB prediction for each training sample for trees 1, ..., N. By "average" we use the mean of the predictions for a continuous response, or the majority vote may be used for a categorical response (the majority vote is the class with most votes over the set of trees 1, ..., N).

For example, assume we had the following OOB predictions for 10 samples in training set on 10 trees

```
set.seed(123)
oob.p <- matrix(rpois(100, lambda = 4), ncol = 10)
colnames(oob.p) <- paste0("tree", seq_len(ncol(oob.p)))
rownames(oob.p) <- paste0("samp", seq_len(nrow(oob.p)))
oob.p[sample(length(oob.p), 50)] <- NA
oob.p
> oob.p
tree1 tree2 tree3 tree4 tree5 tree6 tree7 tree8 tree9 tree10
samp1 NA NA 7 8 2 1 NA 5 3 2
samp2 6 NA 5 7 3 NA NA NA NA NA
samp3 3 NA 5 NA NA NA 3 5 NA NA
samp4 6 NA 10 6 NA NA 3 NA 6 NA
samp5 NA 2 NA NA 2 NA 6 4 NA NA
samp6 NA 7 NA 4 NA 2 4 2 NA NA
samp7 NA NA NA 5 NA NA NA 3 9 5
samp8 7 1 4 NA NA 5 6 NA 7 NA
samp9 4 NA NA 3 NA 7 6 3 NA NA
samp10 4 8 2 2 NA NA 4 NA NA 4
```

Where `NA`

means the sample was in the training data for that tree (in other words it was not in the OOB sample).

The mean of the non-`NA`

values for each row gives the the OOB prediction for each sample, for the *entire forest*

```
> rowMeans(oob.p, na.rm = TRUE)
samp1 samp2 samp3 samp4 samp5 samp6 samp7 samp8 samp9 samp10
4.00 5.25 4.00 6.20 3.50 3.80 5.50 5.00 4.60 4.00
```

As each tree is added to the forest, we can compute the OOB error up to an including that tree. For example, below are the cummulative means for each sample:

```
FUN <- function(x) {
na <- is.na(x)
cs <- cumsum(x[!na]) / seq_len(sum(!na))
x[!na] <- cs
x
}
t(apply(oob.p, 1, FUN))
> print(t(apply(oob.p, 1, FUN)), digits = 3)
tree1 tree2 tree3 tree4 tree5 tree6 tree7 tree8 tree9 tree10
samp1 NA NA 7.00 7.50 5.67 4.50 NA 4.6 4.33 4.0
samp2 6 NA 5.50 6.00 5.25 NA NA NA NA NA
samp3 3 NA 4.00 NA NA NA 3.67 4.0 NA NA
samp4 6 NA 8.00 7.33 NA NA 6.25 NA 6.20 NA
samp5 NA 2 NA NA 2.00 NA 3.33 3.5 NA NA
samp6 NA 7 NA 5.50 NA 4.33 4.25 3.8 NA NA
samp7 NA NA NA 5.00 NA NA NA 4.0 5.67 5.5
samp8 7 4 4.00 NA NA 4.25 4.60 NA 5.00 NA
samp9 4 NA NA 3.50 NA 4.67 5.00 4.6 NA NA
samp10 4 6 4.67 4.00 NA NA 4.00 NA NA 4.0
```

In this way we see how the prediction is accumulated over the N trees in the forest up to a given iteration. If you read across the rows, the right-most non-`NA`

value is the one I show above for the OOB prediction. That is how traces of OOB performance can be made - a RMSEP can be computed for the OOB samples based on the OOB predictions accumulated cumulatively over the N trees.

Note that the R code shown is not take from the internals of the randomForest code in the **randomForest** package for R - I just knocked up some simple code so that you can follow what is going on once the predictions from each tree are determined.

It is because each tree is built from a bootstrap sample and that there are a large number of trees in a random forest, such that each training set observation is in the OOB sample for one or more trees, that OOB predictions can be provided for all samples in the training data.

I have glossed over issues such as missing data for some OOB cases etc, but these issues also pertain to a single regression or classification tree. Also note that each tree in a forest uses only `mtry`

randomly-selected variables.

## Best Answer

Random forest has several hyperparameters that need to be tuned. To do this correctly, you need to implement a nested cross validation structure. The inner CV will measure out-of-sample performance over a sequence of hyperparameters. The outer CV will characterize performance of the procedure used to select hyperparameters, and can be used to get unbiased estimates of AUC and so forth.

The hyperparameters that you may tune include

`ntree`

,`mtry`

and tree depth (either maxnodes or nodesize or both). By far, the most important is`mtry`

. The default`mtry`

for $p$ features is $\sqrt{p}$. Increasing`mtry`

may improve performance. I recommend trying a grid over the range $\sqrt{p}/2$ to $3\sqrt{p}$ by increments of $\sqrt{p}/2$.Tuning

`ntree`

is basically an exercise in selecting a large enough number of trees so that the error rate stabilizes. Because each tree is i.i.d., you can just train a large number of trees and pick the smallest $n$ such that the OOB error rate is basically flat.By default,

`randomForest`

will build trees with a minimum node size of 1. This can be computationally expensive for many observations. Tuning node size/tree depth might be useful for you, if only to reduce training time. InElements of Statistical Learning, the authors write that they have only observed modest gains in performance to be had by tuning trees in this way.