I was reading a book on ML and it says
Actually, since the training algorithm used by Scikit-Learn is stochastic you may get very different models even on the same training data (unless you set the random_state hyperparameter)
I wonder if such randomness is due to the way thresholds are chosen in scikit-learn. So the documentation explains how scikit-learn splits a node based on a single feature and a weighted impurity measure. And for now, let's consider each feature before making a split (i.e., set max_features=None
) and make sure there is no randomness coming from our choice of features.
My understanding is if we use the same training set, and if we select a finite number of thresholds based on a non-random rule, for example, use midpoints (i.e., $(x_{(i)}^j + x_{(i+1)}^j) / 2$, $x_{(i)}^j$ is the $i$-th smallest value ranked by $j$-th component for each training vector $\mathbf{x}$) as thresholds. Then it's very likely there's only one global solution $(j, t_m)$ for the best split. The randomness only kicks in when there're more than one minima we can use for splitting.
Also, besides random_state
being used for selecting features (when max_features!=None
) to be considered when looking for the best split, where else is it used?
Best Answer
In random forests, stochasticity is mainly caused by the following two factors:
bootstrap
parameter.In a Decision Tree, we have none of them. There may also be other sources of randomness due to implementation differences. For example,
DecisionTree
insklearn
uses asplitter
argument which isbest
by default, however it can be set torandom
.RandomForest
uses this class with default attributes, i.e.splitter=best
, where the threshold to choose is the arithmetic average of boundary values. There seems to be no random-rule here.For Decision Tree, (and therefore Random Forest), the following explanation is given in the documentation (which you've also found in the RF's explanation page):
This means, if
max_features=None
,splitter = 'best'
, it's less likely to be random, yes; but, since the features are randomly shuffled at each node, if two of them gives the same gain, we end up with different trees at different runs. Note that, this property is implementation specific.