Solved – How does a Decision Tree model choose thresholds in scikit-learn

cartmachine learningrandom forestscikit learn

I was reading a book on ML and it says

Actually, since the training algorithm used by Scikit-Learn is stochastic you may get very different models even on the same training data (unless you set the random_state hyperparameter)

I wonder if such randomness is due to the way thresholds are chosen in scikit-learn. So the documentation explains how scikit-learn splits a node based on a single feature and a weighted impurity measure. And for now, let's consider each feature before making a split (i.e., set max_features=None) and make sure there is no randomness coming from our choice of features.

My understanding is if we use the same training set, and if we select a finite number of thresholds based on a non-random rule, for example, use midpoints (i.e., $(x_{(i)}^j + x_{(i+1)}^j) / 2$, $x_{(i)}^j$ is the $i$-th smallest value ranked by $j$-th component for each training vector $\mathbf{x}$) as thresholds. Then it's very likely there's only one global solution $(j, t_m)$ for the best split. The randomness only kicks in when there're more than one minima we can use for splitting.

Also, besides random_state being used for selecting features (when max_features!=None) to be considered when looking for the best split, where else is it used?

Best Answer

In random forests, stochasticity is mainly caused by the following two factors:

For each tree, only a subset of features is selected (randomly), and the decision tree is trained using only those features
For each tree, a bootstrap sample of the training data set is used, i.e. dataset sampled with replacement. In sklearn, this can be controlled via bootstrap parameter.

In a Decision Tree, we have none of them. There may also be other sources of randomness due to implementation differences. For example, DecisionTree in sklearn uses a splitter argument which is best by default, however it can be set to random. RandomForest uses this class with default attributes, i.e. splitter=best, where the threshold to choose is the arithmetic average of boundary values. There seems to be no random-rule here.

For Decision Tree, (and therefore Random Forest), the following explanation is given in the documentation (which you've also found in the RF's explanation page):

The features are always randomly permuted at each split. Therefore, the best found split may vary, even with the same training data and max_features=n_features, if the improvement of the criterion is identical for several splits enumerated during the search of the best split. To obtain a deterministic behaviour during fitting, random_state has to be fixed.

This means, if max_features=None, splitter = 'best', it's less likely to be random, yes; but, since the features are randomly shuffled at each node, if two of them gives the same gain, we end up with different trees at different runs. Note that, this property is implementation specific.

Best Answer

Related Solutions

Solved – Random forest low score on testing data (scikit-learn)

Solved – random forests: feature importance changes with each run

Related Question