Solved – Number of Samples per-Tree in a Random Forest

machine learningrandom forestsample-sizesamplingscikit learn

How many samples does each tree of a random forest use to train in sci-kit learn the implementation of Random Forest Regression?
And, how does the number of samples change when the bootstrap option is on compared to when it’s off?

In “Random Forest” by Breiman, I believe he mentions that each tree is trained on 1/3 of the data. Is that the case in the implementation of scikit learn as well?

sklearn ref: http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html

github source: https://github.com/scikit-learn/scikit-learn/blob/a24c8b46/sklearn/ensemble/forest.py#L1019

Best Answer

I am answering my question. I got a chance to talk to the people who implemented the random forest in sci-kit learn. Here is the explanation:

"If bootstrap=False, then each tree is built on all training samples.

If bootstrap=True, then for each tree, N samples are drawn randomly with replacement from the training set and the tree is built on this new version of the training data. This introduces randomness in the training procedure since trees will each be trained on slightly different training sets. In expectation, drawing N samples with replacement from a dataset of size N will select ~2/3 unique samples from the original set. "

From Scikit Learn v0.22, you can still use boostraping but limit the maximum number of samples each tree is trained on (max_samples of RandomForestRegressor class).

Excellent sources on this subject for more details:

Why on average does each bootstrap sample contain roughly two thirds of observations?
Louppe, Gilles. "Understanding random forests: From theory to practice." arXiv preprint arXiv:1407.7502 (2014).
Breiman, Leo. "Random forests." Machine learning 45.1 (2001): 5-32.
Breiman, Leo. Classification and regression trees. Routledge, 2017.
Explaining to laypeople why bootstrapping works

Best Answer

Related Solutions

Random Forest – How to Make a Single Decision Tree Using Scikit-Learn

Random Forest – How to Prune Random Forests vs. Stopping Criteria

Related Question