Machine Learning – Understanding Out of Bag Error in Random Forest and Data Partitioning

cross-validationmachine learningrandom forest

I have a question concerning OOB error in random forests and data partitioning. As far as i know in random forests the trees are not pruned. Also we use OOB error for measuring the performance of the forest. Why then we should use data partitioning (training – validation) when constructing a random forest. In many cases that i have seen a data partitioning process is used. In this case how can the validation error be interpreted?

Thanks in advance,

Andreas

Best Answer

Training a model, tuning its hyperparameters, and evaluating its performance are typically done using independent training, validation, and test sets. This three-way split can take the form of holdout or nested cross validation. The independence of these sets is important because, otherwise, estimates of the error would be downwardly biased--we'd select poor models and expect them to perform better on future data than they really would. Because random forests already use bootstrapping for fitting individual tries, they readily yield the out-of-bag (OOB) error. This is an unbiased estimate of the error on future data. As such, it can take the place of the validation or test error, and is cheaper to compute than using nested cross validation.

If we had a fixed set of hyperparameters, we could train a random forest on the entire dataset, estimate performance using the OOB error, and call it a day. But, random forests have hyperparameters that may need to be tuned to balance between under- and overfitting. One of these is the number of features considered for each split. Another is tree size, which is typically controlled by limiting the depth or number of nodes when growing the tree, rather than by pruning after the fact. Rather than splitting the data into training, validation, and test sets, we can use the OOB error in place of the the validation or test set error. For example, hyperparameters could be tuned to minimize OOB error and performance could be evaluated on the test set (possibly using cross validation, with no need for nesting).

Related Solutions

Random Forest – Understanding Out-of-Bag Sample Size in Random Forests

It comes from the construction of a bootstrap sample: you're sampling $n$ observations with replacement to a sample size of $n$. The probability that an observation is omitted is $(1-\frac{1}{n})^n.$* Now consider the definition of $\exp(-1)=\lim \limits_{n\to\infty}(1-\frac{1}{n})^n$ and observe that $\exp(-1)=0.3678...\approx\frac{1}{3}.$

*To verify this, I will define the probability space of the bootstrap: $\Omega=\{x_1, x_2, x_3, \dots, x_n\}$ where each $x_{i\in I=\{i\in\mathbb{N}:i\le n\}}$ is an observation, $\mathcal{F}=2^\Omega$. We will denote the boostrap sample as $B$. Note that we can take this $\sigma$-field $\mathcal{F}$ because we must have a finite number of observations. Collecting our bootstrap sample one observation at a time, our event of interest $E$ occurs when some observation $x_i$ is selected for the bootstrap sample, and we must define a probability measure for it. That is, $P(E)=P(\{x_i \in B\})$.

We can think of drawing a bootstrap sample as an experiment where there are $n$ trials. Each trial is selecting one of our observations uniformly at random with replacement, so it will either include $x_i$ with probability $P(E)=\frac{|E|}{|\Omega|}=\frac{1}{n},$ or exclude $x_i$ with probability $P(E^c)=\frac{|\Omega|-|E|}{|\Omega|}=1-\frac{1}{n}.$ Our probability space $(\Omega, \mathcal{F}, P)$ is now completely defined. The experiment we're performing has $n$ trials, so the probability that $x_i$ is omitted from all of them is $(\bigcup_{i=1}^n P(E))^c=\bigcap_{i=1}^n P(E^c)=(1-\frac{1}{n})^n$.

Solved – Random forest, cross validation or out-of-bag error

I don't know much of text analyses, so I cannot answer that part.

However, whatever CV metric favored in litterature should be compatible with OOB-CV also. Except that OOB-CV for time series permutes the series of events, I have not heard of any biases of using OOB-CV. As thumb rule, OOB-CV is similar to 5-fold CV.

Below I post a naive example for the R randomForest implementation of how to use other metrics.

library(randomForest)
set.seed(123)
obs=2000
X = matrix(rnorm(obs))
y = X+rnorm(obs)
plot(X,y)
rf = randomForest(X,y)
mse = function(pred,true) mean((pred-true)^2)
myCV_randomForest = function(rf,stat=mse,...) stat(predict(rf,...),rf$y)
myCV_randomForest(rf,stat=mse)
myCV_randomForest(rf,stat=function(x,y) sd(x-y)) #use some other metric

Best Answer

Related Solutions

Random Forest – Understanding Out-of-Bag Sample Size in Random Forests

Solved – Random forest, cross validation or out-of-bag error

Related Question