Random-Forest – Understanding Randomness in Random Forests Algorithms

random forest

One method to implement randomness in random forests is to use a random subset of the features for a split node.

What will be done at the next split node of the same branch beneath (that current / former split node)? Is it a random subset of the former subset, so, a subsubset? Because I think choosing a subset from the entire feature subset does not make sense?

Best Answer

At each split, you draw a new random sample of $m$ features.

From Hastie et al. Elements of Statistical Learning:

Algorithm 15.1 Random Forest for Regression or Classification.

  1. For $b = 1$ to $B$:

a. Draw a bootstrap sample $Z^*$ of size $N$ from the training data.

b. Grow a random-forest tree $T_b$ to the bootstrapped data, by recursively repeating the following steps for each terminal node of the tree, until the minimum node size $n_\min$ is reached.

i. Select $m$ variables at random from the $p$ variables.

ii. Pick the best variable/split-point among the $m$.

iii. Split the node into two daughter nodes.

  1. Output the ensemble of trees $\{T_b\}^B_1$.

To make a prediction at a new point $x$:

  • Regression: $\hat{f}^B_\text{rf}(x) = \frac{1}{B}\sum_{b=1}^B T_b(x).$

  • Classification: Let $\hat{C}_b(x)$ be the class prediction of the $b$th random-forest tree. Then $\hat{C}^B_\text{rf} (x) = > \textit{majority vote} \{\hat{C}_b(x)\}^B_ 1$.


Here's an example. You have a model with $p=4$ features, numbered 0, 1, 2, 3. You've set $m=2$, and at the first split you randomly draw 1,3. Then at the second split, you randomly draw 0,1. Then at the third split you randomly draw 1,3 again. In this example, choosing feature 1 in all three examples is purely a random coincidence. Likewise, so is choosing the same pair 1,3 twice also a random coincidence.

Related Question