Solved – In Random Forest, why is a random subset of features chosen at the node level rather than at the tree level?

feature selectionhistoryimportancemachine learningrandom forest

My Question: Why does random forest consider random subsets of features for splitting at the node level within each tree rather than at the tree level?

Background: This is something of a history question. Tin Kam Ho published this paper on constructing "decision forests" by randomly selecting a subset of features to use for growing each tree in 1998. Several years later, in 2001, Leo Breiman published his seminal Random Forest paper, wherein the feature subset is randomly selected at each node within each tree, not at each tree. While Breiman cited Ho, he did not specifically explain the move from tree-level to node-level random feature selection.

I'm wondering what specifically motivated this development. It seems that selecting the feature subset at the tree level would still accomplish the desired decorrelation of the trees.

My theory: I haven't seen this articulated elsewhere, but it seems like the random subspace method would be less efficient in terms of getting estimates of feature importance. To obtain estimates of variable importance, for each tree, the features are randomly permuted one by one, and the increase in misclassification or increase in error for the out-of-bag observations is recorded. The variables for which the misclassification or error increase resulting from this random permutation is high are those with the greatest importance.

If we use the random subspace method, for each tree, we are only considering $m$ of the $p$ features. It may take several trees to consider all $p$ predictors even once. On the other hand, if we consider a different subset $m_i$ of the $p$ features at each node, we will consider each feature more times after fewer trees, giving us a more robust estimate of the feature importance.

What I've looked at thus far: So far, I have read Breiman's paper and Ho's paper, and done a broad online search for comparisons of the methods without finding a definitive answer. Note that a similar question was asked before. This question goes a bit further by including my speculation/work toward a possible solution. I would be interested in any answers, relevant citations, or simulation studies comparing the two approaches. If none are forthcoming, I plan to run my own simulation comparing the two methods.

Best Answer

Suppose we have 10 features f1, f2, ..., f9, f10, then when we take a subset to let's suppose f1, f3, f4, f8 of features at tree level itself, then we construct the whole tree taking these 4 features into consideration.

We calculate the entropy, compare only these 4 features at every node and take that feature that yields maximum entropy. This isn't much use as we are restricting our tree learning to only those 4 features. Contrary to this, when we take some subset of features let's say f1, f8, f9 at first node, we calculate the entropy and compare them among these 3 features and chose the one that gives maximum value. Instead of growing the tree further with same features, we chose another subset of features let's say f4, f7, f2 and make the split based on these features. Suppose f8 was selected at the first node and f2 was selected at the second node. Model is able to learn the relationship between these both which wouldn't be possible if there is some other feature that gives maximum entropy than f2 after f8 has been selected as the root node.

In this way, the model can learn the relationship between different features in a more diversified way. This approach will have a number of features explored in a single tree and thus relations among them is preserved. Hope you got it now :)