Random Forest Classifier – Identifying Optimal Parameters

classificationmachine learningrandom forest

Currently i am using RF toolbox on MATLAB for a binary classification Problem

Data Set: 50000 samples and more than 250 features

So what should be the number of trees and randomly selected feature on each split to grow the trees?
can any other parameter greatly affect the results?

Best Answer

Pick a large number of trees, say 100. From what I have read on the Internet, pick $\sqrt{250}$ randomly selected features. However, in the original paper, Breiman used about the closest integer to $\frac{\log{M}}{\log{2}}$.

I would say cross-validation is usually the key to finding optimal parameters, but I do not know enough about random forests.

Related Solutions

Solved – What makes a Random Forest random besides bootstrapping and random sampling of features

If we set aside the discrepancies arising from roundoff error, the remaining differences originate in the treatment of ties. Class sklearn.ensemble.RandomForestClassifier is composed of many instances of sklearn.tree.DecisionTreeClassifier (you can verify this by reading the source). If we read the documentation for sklearn.tree.DecisionTreeClassifier, we find that there is some non-determinism in how the trees are built, even when using all features. This is because of how the fit method handles ties.

The features are always randomly permuted at each split. Therefore, the best found split may vary, even with the same training data and max_features=n_features, if the improvement of the criterion is identical for several splits enumerated during the search of the best split. To obtain a deterministic behaviour during fitting, random_state has to be fixed.

In most cases, this is roundoff error. Whenever comparing equality of floats, you want to use something like np.isclose, and not ==. Using == is the way of madness.

import numpy as np
np.isclose(pred_1, pred_2)
array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True, False,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True])

For some reason, only the 34th entry is mismatched in a way that is not accounted for by numerical error.

mistake = np.where(np.logical_not(np.isclose(pred_1, pred_2)))
mistake
# array([34])
pred_1[mistake]
# array([33.54285714])
pred_2[mistake]
# array([31.82857143])

If I fix the seed used for the models, this discrepancy disappears. It may re-appear if you choose a different pair of seeds. I don't know.

model3 = RandomForestRegressor(bootstrap=False, max_features=1.0, max_depth=3, random_state=13)
model4 = RandomForestRegressor(bootstrap=False, max_features=1.0, max_depth=3, random_state=14)

pred_3 = model3.fit(X_train, y_train).predict(X_test)
pred_4 = model4.fit(X_train, y_train).predict(X_test)
np.isclose(pred_3, pred_4).all()
# True

Best Answer

Related Solutions

Solved – What makes a Random Forest random besides bootstrapping and random sampling of features

Related Question