Random Forest Hyperparameter – Understanding Min_Samples_Leaf in Scikit-Learn’s RandomForestClassifier

hyperparametermachine learningrandom forest

I'm confused about a particular part of the documentation. I want to know what min_samples_leaf refers to when it's input as a float.

min_samples_leaf : int or float, default=1
The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least min_samples_leaf training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression.
If int, then consider min_samples_leaf as the minimum number.
If float, then min_samples_leaf is a fraction and ceil(min_samples_leaf * n_samples) are the minimum number of samples for each node.

My question is this: n_samples is not a valid parameter of this model. If I set min_samples_leaf to 0.1, does that mean that at least 10% of the max_samples (i.e., the total number of samples taken during the bootstrap) have to be in each child, or does it mean that at least 10% of the samples that are currently in that node being considered for splitting must be in each child?

Best Answer

n_samples is an implicit parameter of the model when calling fit and predict functions, calculated based on X or y matrices, e.g.

X: {array-like, sparse matrix} of shape (n_samples, n_features)

So, when float, min_samples_leaf is the percentage of the total number of samples during training; not the number of samples in the node that is to be split.

This can also be seen in the following code segment of decision tree classifier:

...
# Determine output settings
n_samples, self.n_features_in_ = X.shape
...
min_samples_split = int(ceil(self.min_samples_split * n_samples))
...

Best Answer

Related Solutions

Solved – Using LASSO on random forest

Solved – number of nodes in an unpruned decision tree

Related Question