Solved – What makes a Random Forest random besides bootstrapping and random sampling of features

machine learningpythonrandom forestscikit learnsupervised learning

After reading about random forests in the original paper and in textbooks I was under the impression that what makes the model random is bootstrapping – training each tree on a different random subset of observations drawn with replacement – and random subsampling of features (sometimes called "feature bootstrapping) – making each split considering only a limited number of randomly selected features.

However, playing around with the Random Forest in Scikit-Learn has made me question this assumption. When using a random forest in Scikit-Learn, you can disable bootstrapping and use no random subsampling of features. By the above logic, this should make all the trees in the forest the same and two random forests without these features and otherwise identical should produce the same predictions.

However, creating multiple models without bootstrapping of observations or subsampling of features results in forests with different trees and which generate unequal predictions. What else makes the random forest random besides sampling of observations and subsampling of features?

Here is code I used to test out whether two models with bootstrap=False and max_features=1.0 (use all features) make the same predictions in Scikit-Learn.

# Use Boston housing regression dataset
from sklearn.datasets import load_boston
boston = load_boston()

import pandas as pd
X = pd.DataFrame(data=boston.data, columns=boston.feature_names)
y= pd.Series(data=boston.target)

# Split into training and testing
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y,
 random_state=100, test_size=100)

from sklearn.ensemble import RandomForestRegressor

# Make two random forests with no bootstrapping and using all features
model1 = RandomForestRegressor(bootstrap=False, max_features=1.0, max_depth=None)
model2 = RandomForestRegressor(bootstrap=False, max_features=1.0, max_depth=None)

# Make predictions with both models
pred_1 = model1.fit(X_train, y_train).predict(X_test)
pred_2 = model2.fit(X_train, y_train).predict(X_test)


# Test predictions for equality
import numpy as np
np.allclose(pred_1, pred_2)

# Output 
False


# Look at predictions which disagree
not_close = np.where(~np.isclose(pred_1, pred_2))
pred_1[not_close]
pred_2[not_close]

#Output

array([29.43, 24.34, 18.39, 19.37, 23.64, 28.22, 21.71, 20.08, 12.54,
       24.71, 26.05, 22.19, 28.29, 22.39, 20.12, 35.41, 47.78, 31.07,
       15.  , 12.11, 13.52,  5.81, 13.96, 25.82, 16.27, 11.42, 16.4 ,
       16.2 , 20.08, 43.53, 24.74, 34.4 , 43.37,  7.84, 13.43, 20.17,
       18.82, 22.97, 16.32, 23.03, 24.26, 28.91, 17.64, 12.64, 11.56,
       16.4 , 20.34, 21.61, 25.3 , 14.37, 34.12, 33.76,  7.94, 20.35,
       14.63, 35.05, 24.39, 16.16, 31.44, 20.28, 10.9 ,  7.34, 32.72,
       10.91, 11.21, 21.96, 41.65, 14.77, 12.84, 16.27, 14.72, 22.34,
       14.44, 17.53, 31.16, 22.66, 23.84, 24.7 , 16.16, 13.91, 30.33,
       48.12, 12.61, 45.58])

array([29.66, 24.5 , 18.34, 19.39, 23.56, 28.34, 21.78, 20.03, 12.91,
       24.73, 25.62, 21.49, 28.36, 22.32, 20.14, 35.14, 48.12, 31.11,
       15.56, 11.84, 13.44,  5.77, 13.9 , 25.81, 16.12, 10.81, 17.15,
       16.18, 20.1 , 41.78, 25.8 , 34.5 , 45.58,  7.65, 12.64, 20.04,
       18.78, 22.43, 15.92, 22.87, 24.28, 29.2 , 17.58, 12.03, 11.49,
       17.15, 20.25, 21.58, 26.05, 12.97, 33.98, 33.94,  8.26, 20.09,
       14.41, 35.19, 24.42, 16.18, 31.2 , 20.5 , 13.61,  7.36, 32.18,
       10.39, 11.07, 21.9 , 41.98, 15.12, 13.12, 16.12, 15.32, 20.84,
       14.49, 17.51, 31.39, 23.46, 23.75, 24.71, 16.42, 13.19, 29.4 ,
       48.46, 12.91, 38.95])

(Thanks to @Sycorax for suggesting using np.allclose() to compare predictions.)

If the random_state of both models is fixed, then the predictions come out exactly the same. This means that an aspect of the models is still stochastic.

I would also think that all the trees would be the same since there is no difference between the examples on which they are trained or the features they consider when making splits. However, limiting the depths of the trees to 3 (with max_depth = 3 compared to no max depth for the previous models) and visualizing them shows differences between the regression trees in the same forest:

These two trees (from the same forest) disagree in node #9 which results in different predictions for the same test point. (I can provide visualization code if that would help).

My question is: what besides random sampling of observations (bootstrapping) and random subsampling of features used for making splits at each node makes the random forest random? If these two features are disabled, then why are all the trees not exactly the same? Is this only a feature of the Scikit-Learn implementation?

Best Answer

If we set aside the discrepancies arising from roundoff error, the remaining differences originate in the treatment of ties. Class sklearn.ensemble.RandomForestClassifier is composed of many instances of sklearn.tree.DecisionTreeClassifier (you can verify this by reading the source). If we read the documentation for sklearn.tree.DecisionTreeClassifier, we find that there is some non-determinism in how the trees are built, even when using all features. This is because of how the fit method handles ties.

The features are always randomly permuted at each split. Therefore, the best found split may vary, even with the same training data and max_features=n_features, if the improvement of the criterion is identical for several splits enumerated during the search of the best split. To obtain a deterministic behaviour during fitting, random_state has to be fixed.

In most cases, this is roundoff error. Whenever comparing equality of floats, you want to use something like np.isclose, and not ==. Using == is the way of madness.

import numpy as np
np.isclose(pred_1, pred_2)
array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True, False,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True])

For some reason, only the 34th entry is mismatched in a way that is not accounted for by numerical error.

mistake = np.where(np.logical_not(np.isclose(pred_1, pred_2)))
mistake
# array([34])
pred_1[mistake]
# array([33.54285714])
pred_2[mistake]
# array([31.82857143])

If I fix the seed used for the models, this discrepancy disappears. It may re-appear if you choose a different pair of seeds. I don't know.

model3 = RandomForestRegressor(bootstrap=False, max_features=1.0, max_depth=3, random_state=13)
model4 = RandomForestRegressor(bootstrap=False, max_features=1.0, max_depth=3, random_state=14)

pred_3 = model3.fit(X_train, y_train).predict(X_test)
pred_4 = model4.fit(X_train, y_train).predict(X_test)
np.isclose(pred_3, pred_4).all()
# True

Best Answer

Related Solutions

Prediction Intervals – How to Calculate Prediction Intervals for Random Forest Predictions

Solved – Better performance using Random Forest one-Vs-All than Random Forest multiclass

Related Question