Solved – Why does a bagged tree / random forest tree have higher bias than a single decision tree

baggingbiascartrandom forestvariance

If we consider a full grown decision tree (i.e. an unpruned decision tree) it has high variance and low bias.

Bagging and Random Forests use these high variance models and aggregate them in order to reduce variance and thus enhance prediction accuracy. Both Bagging and Random Forests use Bootstrap sampling, and as described in "Elements of Statistical Learning", this increases bias in the single tree.

Furthermore, as the Random Forest method limits the allowed variables to split on in each node, the bias for a single random forest tree is increased even more.

Thus, the prediction accuracy is only increased, if the increase in bias of the single trees in Bagging and Random Forests is not "overshining" the variance reduction.

This leads me to the two following questions:
1) I know that with bootstrap sampling, we will (almost always) have some of the same observations in the bootstrap sample. But why does this lead to an increase in bias of the individual trees in Bagging / Random Forests?
2) Furthermore, why does the limit on available variables to split on in each split lead to higher bias in the individual trees in Random Forests?

Best Answer

I will accept the answer on 1) from Kunlun, but just to close this case, I will here give the conclusions on the two questions that I reached in my thesis (which were both accepted by my Supervisor):

1) More data produces better models, and since we only use part of the whole training data to train the model (bootstrap), higher bias occurs in each tree (Copy from the answer by Kunlun)

2) In the Random Forests algorithm, we limit the number of variables to split on in each split - i.e. we limit the number of variables to explain our data with. Again, higher bias occurs in each tree.

Conclusion: Both situations are a matter of limiting our ability to explain the population: First we limit the number of observations, then we limit the number of variables to split on in each split. Both limitations leads to higher bias in each tree, but often the variance reduction in the model overshines the bias increase in each tree, and thus Bagging and Random Forests tend to produce a better model than just a single decision tree.