If we set aside the discrepancies arising from roundoff error, the remaining differences originate in the treatment of ties. Class sklearn.ensemble.RandomForestClassifier
is composed of many instances of sklearn.tree.DecisionTreeClassifier
(you can verify this by reading the source). If we read the documentation for sklearn.tree.DecisionTreeClassifier
, we find that there is some non-determinism in how the trees are built, even when using all features. This is because of how the fit
method handles ties.
The features are always randomly permuted at each split. Therefore, the best found split may vary, even with the same training data and max_features=n_features, if the improvement of the criterion is identical for several splits enumerated during the search of the best split. To obtain a deterministic behaviour during fitting, random_state has to be fixed.
In most cases, this is roundoff error. Whenever comparing equality of floats, you want to use something like np.isclose
, and not ==
. Using ==
is the way of madness.
import numpy as np
np.isclose(pred_1, pred_2)
array([ True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, False, True,
True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True, True,
True])
For some reason, only the 34th entry is mismatched in a way that is not accounted for by numerical error.
mistake = np.where(np.logical_not(np.isclose(pred_1, pred_2)))
mistake
# array([34])
pred_1[mistake]
# array([33.54285714])
pred_2[mistake]
# array([31.82857143])
If I fix the seed used for the models, this discrepancy disappears. It may re-appear if you choose a different pair of seeds. I don't know.
model3 = RandomForestRegressor(bootstrap=False, max_features=1.0, max_depth=3, random_state=13)
model4 = RandomForestRegressor(bootstrap=False, max_features=1.0, max_depth=3, random_state=14)
pred_3 = model3.fit(X_train, y_train).predict(X_test)
pred_4 = model4.fit(X_train, y_train).predict(X_test)
np.isclose(pred_3, pred_4).all()
# True
See also: How does a Decision Tree model choose thresholds in scikit-learn?
Best Answer
Pick a large number of trees, say 100. From what I have read on the Internet, pick $\sqrt{250}$ randomly selected features. However, in the original paper, Breiman used about the closest integer to $\frac{\log{M}}{\log{2}}$.
I would say cross-validation is usually the key to finding optimal parameters, but I do not know enough about random forests.