Random-Forest – Understanding Randomness in Random Forests Algorithms

random forest

One method to implement randomness in random forests is to use a random subset of the features for a split node.

What will be done at the next split node of the same branch beneath (that current / former split node)? Is it a random subset of the former subset, so, a subsubset? Because I think choosing a subset from the entire feature subset does not make sense?

Best Answer

At each split, you draw a new random sample of $m$ features.

From Hastie et al. Elements of Statistical Learning:

Algorithm 15.1 Random Forest for Regression or Classification.

For $b = 1$ to $B$:

a. Draw a bootstrap sample $Z^*$ of size $N$ from the training data.

b. Grow a random-forest tree $T_b$ to the bootstrapped data, by recursively repeating the following steps for each terminal node of the tree, until the minimum node size $n_\min$ is reached.

i. Select $m$ variables at random from the $p$ variables.

ii. Pick the best variable/split-point among the $m$.

iii. Split the node into two daughter nodes.

Output the ensemble of trees $\{T_b\}^B_1$.

To make a prediction at a new point $x$:

Regression: $\hat{f}^B_\text{rf}(x) = \frac{1}{B}\sum_{b=1}^B T_b(x).$

Classification: Let $\hat{C}_b(x)$ be the class prediction of the $b$th random-forest tree. Then $\hat{C}^B_\text{rf} (x) = > \textit{majority vote} \{\hat{C}_b(x)\}^B_ 1$.

Here's an example. You have a model with $p=4$ features, numbered 0, 1, 2, 3. You've set $m=2$, and at the first split you randomly draw 1,3. Then at the second split, you randomly draw 0,1. Then at the third split you randomly draw 1,3 again. In this example, choosing feature 1 in all three examples is purely a random coincidence. Likewise, so is choosing the same pair 1,3 twice also a random coincidence.

Related Solutions

Solved – Random forest: advantages/disadvantages of selecting randomly subset features for every tree vs for every node

The general idea is that both Bagging and Random Forests are methods for variance reduction. This means that they work well with estimators that have LOW BIAS and HIGH VARIANCE (estimators that overfit, to put it simply). Moreover, the averaging of the estimator works best if these are UNCORRELATED from each other.

Decision trees are perfect for this job because, in particolar when fully grown, they can learn very complex interactions (therefore having low bias), but are very sensitive to the input data (high variance).

Both sampling strategies have the goal of reducing the correlation between the trees, which reduces the variance of the averaged ensemble (I suggest Elements of Statistical Learning, Chap. 15 for clarifications).
However, while sampling features at every node still allows the trees to see most variables (in different orders) and learn complex interactions, using a subsample for every tree greatly limits the amount of information that a single tree can learn. This means that trees grown in this fashion are going to be less deep, and with much higher bias, in particular for complex datasets. On the other hand, it is true that trees built this way will tend to be less correlated to each other, as they are often built on completely different subsets of features, but in most scenarios this will not overweight the increase in bias, therefore giving a worse performance on most use cases.

Solved – What makes a Random Forest random besides bootstrapping and random sampling of features

If we set aside the discrepancies arising from roundoff error, the remaining differences originate in the treatment of ties. Class sklearn.ensemble.RandomForestClassifier is composed of many instances of sklearn.tree.DecisionTreeClassifier (you can verify this by reading the source). If we read the documentation for sklearn.tree.DecisionTreeClassifier, we find that there is some non-determinism in how the trees are built, even when using all features. This is because of how the fit method handles ties.

The features are always randomly permuted at each split. Therefore, the best found split may vary, even with the same training data and max_features=n_features, if the improvement of the criterion is identical for several splits enumerated during the search of the best split. To obtain a deterministic behaviour during fitting, random_state has to be fixed.

In most cases, this is roundoff error. Whenever comparing equality of floats, you want to use something like np.isclose, and not ==. Using == is the way of madness.

import numpy as np
np.isclose(pred_1, pred_2)
array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True, False,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True])

For some reason, only the 34th entry is mismatched in a way that is not accounted for by numerical error.

mistake = np.where(np.logical_not(np.isclose(pred_1, pred_2)))
mistake
# array([34])
pred_1[mistake]
# array([33.54285714])
pred_2[mistake]
# array([31.82857143])

If I fix the seed used for the models, this discrepancy disappears. It may re-appear if you choose a different pair of seeds. I don't know.

model3 = RandomForestRegressor(bootstrap=False, max_features=1.0, max_depth=3, random_state=13)
model4 = RandomForestRegressor(bootstrap=False, max_features=1.0, max_depth=3, random_state=14)

pred_3 = model3.fit(X_train, y_train).predict(X_test)
pred_4 = model4.fit(X_train, y_train).predict(X_test)
np.isclose(pred_3, pred_4).all()
# True

Best Answer

Related Solutions

Solved – Random forest: advantages/disadvantages of selecting randomly subset features for every tree vs for every node

Solved – What makes a Random Forest random besides bootstrapping and random sampling of features

Related Question