Random Forest – How to Prune Random Forests vs. Stopping Criteria

cartrandom forestscikit learn

I have recently noticed that SciKit-Learn now supports Cost Complexity Pruning, which is great. Since this has been implemented, should I still use other regression trees/ random forest hyper-parameters to control tree growth, or just use pruning of fully grown trees?

I see that in Classification and Regression Trees by Breiman et al (1984) mentions that one should use Pruning over Stopping? Is this still the case if I want to use a hyper-parameter grid search in SciKit Learn?

Best Answer

As you say, Breiman himself suggests pruning over stopping, and the reason for this is that stopping might be short-sighted, as blocking a "bad" split now might prevent some very "good" splits from happening later. Pruning, on the other hand, starts from the fully grown tree (so it takes longer to run) but it does not have this problem.

When using decision trees, I would therefore only use a pruning parameter to avoid overfitting, while you can still keep a "relaxed" value of max_depth or min_samples_leaf just in case you want your trees to avoid having a certain size.

For Random Forests instead, I would not use any pruning/stopping criterion, unless you have restrictions with memory usage, as the algorithm most of the time works best with fully grown trees (it is mentioned both by Breiman in the original paper, and by Hastie/Tibshirani in their book, if you need references)