Solved – Number of Samples per-Tree in a Random Forest

machine learningrandom forestsample-sizesamplingscikit learn

  1. How many samples does each tree of a random forest use to train in sci-kit learn the implementation of Random Forest Regression?

  2. And, how does the number of samples change when the bootstrap option is on compared to when it’s off?

In “Random Forest” by Breiman, I believe he mentions that each tree is trained on 1/3 of the data. Is that the case in the implementation of scikit learn as well?

sklearn ref: http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html

github source: https://github.com/scikit-learn/scikit-learn/blob/a24c8b46/sklearn/ensemble/forest.py#L1019

Best Answer

I am answering my question. I got a chance to talk to the people who implemented the random forest in sci-kit learn. Here is the explanation:

"If bootstrap=False, then each tree is built on all training samples.

If bootstrap=True, then for each tree, N samples are drawn randomly with replacement from the training set and the tree is built on this new version of the training data. This introduces randomness in the training procedure since trees will each be trained on slightly different training sets. In expectation, drawing N samples with replacement from a dataset of size N will select ~2/3 unique samples from the original set. "

From Scikit Learn v0.22, you can still use boostraping but limit the maximum number of samples each tree is trained on (max_samples of RandomForestRegressor class).

Excellent sources on this subject for more details:

Related Question