The general idea is that both Bagging and Random Forests are methods for variance reduction. This means that they work well with estimators that have LOW BIAS and HIGH VARIANCE (estimators that overfit, to put it simply). Moreover, the averaging of the estimator works best if these are UNCORRELATED from each other.
Decision trees are perfect for this job because, in particolar when fully grown, they can learn very complex interactions (therefore having low bias), but are very sensitive to the input data (high variance).
Both sampling strategies have the goal of reducing the correlation between the trees, which reduces the variance of the averaged ensemble (I suggest Elements of Statistical Learning, Chap. 15 for clarifications).
However, while sampling features at every node still allows the trees to see most variables (in different orders) and learn complex interactions, using a subsample for every tree greatly limits the amount of information that a single tree can learn. This means that trees grown in this fashion are going to be less deep, and with much higher bias, in particular for complex datasets. On the other hand, it is true that trees built this way will tend to be less correlated to each other, as they are often built on completely different subsets of features, but in most scenarios this will not overweight the increase in bias, therefore giving a worse performance on most use cases.
If we set aside the discrepancies arising from roundoff error, the remaining differences originate in the treatment of ties. Class sklearn.ensemble.RandomForestClassifier
is composed of many instances of sklearn.tree.DecisionTreeClassifier
(you can verify this by reading the source). If we read the documentation for sklearn.tree.DecisionTreeClassifier
, we find that there is some non-determinism in how the trees are built, even when using all features. This is because of how the fit
method handles ties.
The features are always randomly permuted at each split. Therefore, the best found split may vary, even with the same training data and max_features=n_features, if the improvement of the criterion is identical for several splits enumerated during the search of the best split. To obtain a deterministic behaviour during fitting, random_state has to be fixed.
In most cases, this is roundoff error. Whenever comparing equality of floats, you want to use something like np.isclose
, and not ==
. Using ==
is the way of madness.
import numpy as np
np.isclose(pred_1, pred_2)
array([ True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, False, True,
True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True, True,
True])
For some reason, only the 34th entry is mismatched in a way that is not accounted for by numerical error.
mistake = np.where(np.logical_not(np.isclose(pred_1, pred_2)))
mistake
# array([34])
pred_1[mistake]
# array([33.54285714])
pred_2[mistake]
# array([31.82857143])
If I fix the seed used for the models, this discrepancy disappears. It may re-appear if you choose a different pair of seeds. I don't know.
model3 = RandomForestRegressor(bootstrap=False, max_features=1.0, max_depth=3, random_state=13)
model4 = RandomForestRegressor(bootstrap=False, max_features=1.0, max_depth=3, random_state=14)
pred_3 = model3.fit(X_train, y_train).predict(X_test)
pred_4 = model4.fit(X_train, y_train).predict(X_test)
np.isclose(pred_3, pred_4).all()
# True
See also: How does a Decision Tree model choose thresholds in scikit-learn?
Best Answer
At each split, you draw a new random sample of $m$ features.
From Hastie et al. Elements of Statistical Learning:
Here's an example. You have a model with $p=4$ features, numbered 0, 1, 2, 3. You've set $m=2$, and at the first split you randomly draw 1,3. Then at the second split, you randomly draw 0,1. Then at the third split you randomly draw 1,3 again. In this example, choosing feature 1 in all three examples is purely a random coincidence. Likewise, so is choosing the same pair 1,3 twice also a random coincidence.