I'm confused about a particular part of the documentation. I want to know what min_samples_leaf
refers to when it's input as a float.
min_samples_leaf : int or float, default=1
The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least min_samples_leaf training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression.
If int, then consider min_samples_leaf as the minimum number.
If float, then min_samples_leaf is a fraction and ceil(min_samples_leaf * n_samples) are the minimum number of samples for each node.
My question is this: n_samples
is not a valid parameter of this model. If I set min_samples_leaf
to 0.1, does that mean that at least 10% of the max_samples
(i.e., the total number of samples taken during the bootstrap) have to be in each child, or does it mean that at least 10% of the samples that are currently in that node being considered for splitting must be in each child?
Best Answer
n_samples
is an implicit parameter of the model when calling fit and predict functions, calculated based onX
ory
matrices, e.g.So, when float,
min_samples_leaf
is the percentage of the total number of samples during training; not the number of samples in the node that is to be split.This can also be seen in the following code segment of decision tree classifier: