Solved – What happens when bootstrapping isn’t used in sklearn.RandomForestClassifier

machine learningpythonrandom forestscikit learn

I've been optimizing a random forest model built from the sklearn implementation. One of the parameters in this implementation of random forests allows you to set Bootstrap = True/False. While tuning the hyperparameters of my model to my dataset, both random search and genetic algorithms consistently find that setting bootstrap=False results in a better model (accuracy increases >1%). I am using 3-fold CV AND a separate test set at the end to confirm all of this. Tuned models consistently get me to ~98% accuracy. The dataset is a few thousands examples large and is split between two classes.

My question is this: is a random forest even still random if bootstrapping is turned off? I thought the whole premise of a random forest is that, unlike a single decision tree (which sees the entire dataset as it grows), RF randomly partitions the original dataset and divies the partitions up among several decision trees. If bootstrapping is turned off, doesn't that mean you just have n decision trees growing from the same original data corpus? Or is it the case that when bootstrapping is off, the dataset is uniformly split into n partitions and distributed to n trees in a way that isn't randomized?

In addition, it doesn't make sense that taking away the main premise of randomness from the algorithm would improve accuracy.

Note: Did a quick test with a random dataset, and setting bootstrap = False garnered better results once again.

from sklearn.ensemble import RandomForestClassifier as rfc
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

X, y = make_classification(n_samples=10000, n_features=4000, n_informative=3000, n_redundant=600, random_state=0, shuffle=True)
X_train, X_test, y_train, y_test = train_test_split(X, y)

clf_True = rfc(bootstrap=True, random_state=0)
clf_False = rfc(bootstrap=False, random_state=0)

clf_True.fit(X_train, y_train)
clf_False.fit(X_train, y_train)

scoreTrue = clf_True.score(X_test, y_test)
scoreFalse = clf_False.score(X_test, y_test)

>>>scoreTrue =  0.5232; scoreFalse = 0.5336

What is going on under the hood?

Edit: I made the number of features high in this example script above because in the data set I'm working with (large text corpus), I have hundreds of thousands of unique terms and only a few thousands training/testing instances. I believe bootstrapping omits ~1/3 of the dataset from the training phase. Could it be that disabling bootstrapping is giving me better results because my training phase is data-starved?

Best Answer

My question is this: is a random forest even still random if bootstrapping is turned off?

Yes, it's still random. Without bootstrapping, all of the data is used to fit the model, so there is not random variation between trees with respect to the selected examples at each stage. However, random forest has a second source of variation, which is the random subset of features to try at each split.

I thought the whole premise of a random forest is that, unlike a single decision tree (which sees the entire dataset as it grows), RF randomly partitions the original dataset and divies the partitions up among several decision trees.

This is incorrect. Random forest bootstraps the data for each tree, and then grows a decision tree that can only use a random subset of features at each split. The documentation states "The sub-sample size is always the same as the original input sample size but the samples are drawn with replacement if bootstrap=True (default)," which implies that bootstrap=False draws a sample of size equal to the number of training examples without replacement, i.e. the same training set is always used.

Detailed explanations of the random forest procedure and its statistical properties can be found in Leo Breiman, "Random Forests," Machine Learning volume 45 issue 1 (2001) as well as the relevant chapter of Hastie et al., Elements of Statistical Learning.

We can verify that this behavior exists specifically in the sklearn implementation if we examine the source, which shows that the original data is not further altered when bootstrap=False

    if forest.bootstrap:
      ...irrelevant...
    elif class_weight == 'balanced_subsample':
      ...irrelevant...
    else:
      tree.fit(X, y, sample_weight=sample_weight, check_input=False)

If bootstrapping is turned off, doesn't that mean you just have n decision trees growing from the same original data corpus?

Yes, with the understanding that only a random subsample of features can be chosen at each split. In sklearn, random forest is implemented as an ensemble of one or more instances of sklearn.tree.DecisionTreeClassifier, which implements randomized feature subsampling.

Or is it the case that when bootstrapping is off, the dataset is uniformly split into n partitions and distributed to n trees in a way that isn't randomized?

No.

Best Answer

Related Solutions

Random Forest Discrepancies – randomForest vs randomForestSRC Discrepancies Explained

Solved – Can we use bootstrap in time series case

Related Question