Random Forest – Understanding Out-of-Bag Sample Size in Random Forests

machine learningrandom forestsampling

I am reading the description of RF here.

In the "How random forests work" section, it is written that:

When the training set for the current tree is drawn by sampling with replacement, about one-third of the cases are left out of the sample. This oob (out-of-bag) data is used to get a running unbiased estimate of the classification error as trees are added to the forest

I am unable to understand if the one-third of the cases (out-of-bag samples size) is:

  • an arbitrary value defined in the algorithm
  • an estimate (e.g. on average, sampling with replacement will leave out one-third of the cases)

or something else.

Best Answer

It comes from the construction of a bootstrap sample: you're sampling $n$ observations with replacement to a sample size of $n$. The probability that an observation is omitted is $(1-\frac{1}{n})^n.$* Now consider the definition of $\exp(-1)=\lim \limits_{n\to\infty}(1-\frac{1}{n})^n$ and observe that $\exp(-1)=0.3678...\approx\frac{1}{3}.$


*To verify this, I will define the probability space of the bootstrap: $\Omega=\{x_1, x_2, x_3, \dots, x_n\}$ where each $x_{i\in I=\{i\in\mathbb{N}:i\le n\}}$ is an observation, $\mathcal{F}=2^\Omega$. We will denote the boostrap sample as $B$. Note that we can take this $\sigma$-field $\mathcal{F}$ because we must have a finite number of observations. Collecting our bootstrap sample one observation at a time, our event of interest $E$ occurs when some observation $x_i$ is selected for the bootstrap sample, and we must define a probability measure for it. That is, $P(E)=P(\{x_i \in B\})$.

We can think of drawing a bootstrap sample as an experiment where there are $n$ trials. Each trial is selecting one of our observations uniformly at random with replacement, so it will either include $x_i$ with probability $P(E)=\frac{|E|}{|\Omega|}=\frac{1}{n},$ or exclude $x_i$ with probability $P(E^c)=\frac{|\Omega|-|E|}{|\Omega|}=1-\frac{1}{n}.$ Our probability space $(\Omega, \mathcal{F}, P)$ is now completely defined. The experiment we're performing has $n$ trials, so the probability that $x_i$ is omitted from all of them is $(\bigcup_{i=1}^n P(E))^c=\bigcap_{i=1}^n P(E^c)=(1-\frac{1}{n})^n$.