Random Forest – References for Number of Features in Random Forest Regression

cartrandom forest

The default number of features $m$ used when making splits in a random forest regression is $m=p$ in Python's sklearn, where $p$ is the number of predictors in the regression problem (see the sklearn docs: If auto, then max_features=n_features). In R's randomForest package, the default number of features used when making splits in a regression problem is $m=p\,/\,3$ (see the randomForest docs, and specifically the mtry argument to randomForest).

This point is (briefly) discussed at https://github.com/scikit-learn/scikit-learn/issues/7254, where an sklearn contributor says that $m=p$ is recommended. I've seen the $m=p\,/\,3$ recommendation in several places (e.g. https://stackoverflow.com/questions/23939750/understanding-max-features-parameter-in-randomforestregressor and https://web.stanford.edu/~hastie/Papers/ESLII.pdf)

My general understanding is that $m$ should always be tuned in random forest regression problems, and that the optimal $m$ could vary depending on the setting. Are there any references that discuss $m$ specifically in the context of regression (as opposed to classification)? Is there any well-founded reason to prefer either R's or sklearn's default value for $m$ (or should the answer be "don't use the default, always tune," so that the default doesn't matter)?

Best Answer

According to Elements of Statistical Learning (section 15.3):

  • Recommended default values are $m = p/3$ for regression problems and $m = \sqrt{p}$ for classification problems (attributed to Breiman)

  • The best value of $m$ depends on the problem, so $m$ should be treated as a tuning parameter.

As discussed by Breiman (2001), the generalization error of a random forest decreases with the generalization error of the individual trees, and with the correlation between trees. Subsampling features is a way to decorrelate the trees. Increasing $m$ makes individual trees more powerful, but also increases their correlation. The optimal value of $m$ achieves a tradeoff between these two opposing effects, and typically lies somewhere in the middle of the range. He states that, in regression problems (compared to classification problems), the generalization error of individual trees decreases more slowly with $m$, and the correlation between trees increases more slowly. So, a larger value of $m$ is needed. This would explain why the commonly recommended default values for $m$ scale faster with $p$ for regression problems than classification problems. However, I didn't see an explicit recommendation for $m = p/3$ in this paper.

Regarding scikit-learn's default setting of $m=p$: As mentioned above, subsampling features is one of the essential properties of random forests, and improves their performance by decorrelating the trees. Setting $m=p$ would remove this benefit, and would make the model equivalent to simple bagged trees. Breiman (2001) showed that this gives inferior performance on regression problems. Someone on the scikit-learn github page you linked claimed that $m=p$ is indeed recommended. However, the paper they cite is about 'extremely randomized trees', not standard random forests. Extremely randomized trees choose completely random split points, whereas random forests optimize the split point. As above, good performance requires a balance between strengthening the individual trees and injecting randomness to decorrelate them. Because extremely randomized trees inject a higher degree of randomness in the split points, it makes sense that they would benefit by compensating with searching over more features to split on. Conversely, random forests fully optimize the split points, so it makes sense that they would benefit by subsampling features to increase randomness.

References:

Breiman (2001). Random Forests.

Hastie, Tibshirani, Friedman (2009). The Elements of Statistical Learning.