Solved – Setting Leaf Node’s Minimum Sample Value for Random Forest / Decision Trees

cartmachine learningrandom forest

I have seen a few useful examples on the SKlearn documentation page where in some situations, over-fitting can be handled to a reasonable extent by making sure that the splits leave each node with at least a certain number of samples/observations.

I'm assuming other software packages, like R, have this feature too, I don't really mind which implementation the answer comes from, because I'm asking more from the theoretical perspective. But for the sake of illustration, let's just use Python's syntax for a moment:

DecisionTreeClassifier(min_samples_leaf=5)

If we omit the min_samples_leaf argument, it will default to 1, and that means the decision tree/random forest will only need 1 observation to justify a split — which does seem somewhat prone to overfitting.

My Question is: How can we decide what the min_samples_leaf argument for the general case should be? If we have a very large dataset, wouldn't 1 or even 5 be too small? Specifically, what formula might that entail; would we make it proportional to n (total observations) or some other heuristic?

Best Answer

If you look at other packages, say "ranger", it gives the following:

enter image description here

Other packages have guidelines of this sort.

I like to think of it as "what is the minimum sufficient samples for the error to be below "x".

If I have binomial data, I like to look at the 95th percentile CI given samples. To get above 50/50, that is to have enough binomial samples all of one state to get a decent chance that your lower 95% CI is above 50/50 you need how many? It is a textbook question. Here is a place to explore it: link.

Error in estimating the mean is a different story for continuous variables. There are various "rules of thumb" thrown around. You have to know the distribution of values, whether it is normal enough, before chanting voodoo. Some say the magic is 35 samples, it is voodoo.

Remember this: There are no "silver bullets".

In science there is no magic. In magic there is no science. Any magic sufficiently analyzed and characterized is indistinguishable from science just as any science that is sufficiently "advanced" (whatever that means) can be indistinguishable from magic.

If there was infinite data, and infinitely fast computers, then it might be easier to get great fits. We have constraints in both, in the real world.

If I want 1% error per leaf, and I have binomial data and big CPU and decent volume of data, I might have 300 samples per leaf. If I want 1/10k error, then I want 40k samples per leaf.

The problem I am solving, with proper application of statistics, gives the number of samples that I must use.