Random Forest – Detailed Motivation Behind Random Forest Algorithm Steps

classificationmachine learningrandom forest

The method that I'm familiar with for constructing a random forest is as follows:
(from http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm)

To build a tree in the forest we:

  1. Bootstrap a sample of size N where N is the size of our training set. Use this bootstrapped sample as the training set for this tree.
  2. At each node of the tree randomly select m of our M features. Select the best of these m features to split on. (where m is a parameter of our Random Forest)
  3. Grow each tree to largest extent possible — i.e. no pruning.

While this algorithm makes sense at a procedural level and certainly produces good results, I'm not clear what the theoretical motivation is behind the steps 1, 2, and 3. Could someone explain what motivated someone to come up with this procedure and why it works so well?

For example: why do we need to perform step 1? It doesn't seem like we're bootstrapping for its usual purpose of variance-reduction.

Best Answer

Ensemble methods (such as random forests) require some element of variation in the datasets that the individual base classifiers are grown on (otherwise random forests would end up with a forest of trees that are too similar). As decision trees are highly sensitive to the observations in the training set, varying the observations (using the bootstrap) was, I suppose, a natural approach to getting the required diversity. The obvious alternative is to vary the features that are used, e.g. train each tree on a subset of the original features. Using the bootstrap samples also allows us to estimate the out-of-bag (OOB) error rate and variable importance.

2 is essentially another way of injecting randomness into the forest. It also has an impact on reducing the correlation among the trees (by using a low mtry value), with the trade-off being (potentially) worsening the predictive power. Using too large a value of mtry will cause the trees to become increasingly similar to one another (and in the extreme you end up with bagging)

I believe that the reason for not pruning is more due to the fact that its not necessary than anything else. With a single decision tree you would normally prune it since it's highly susceptible to overfitting. However, by using the bootstrap samples and growing many trees random forests can grow trees that are individually strong, but not particularly correlated with one another. Basically, the individual trees are overfit but provided their errors are not correlated the forest should be reasonably accurate.

The reason it works well is similar to Condorcet's jury theorem (and the logic behind methods such as boosting). Basically you have lots of weak learners that only need to perform marginally better than random guessing. If this is true you can keep adding weak learners, and in the limit you would get perfect predictions from your ensemble. Clearly this is restricted due to the errors of the learners becoming correlated, which prevents the ensemble's performance improving.