Solved – Scalable Random Forest For Massive Data

baggingbootstrapensemble learningmachine learningrandom forest

My problem is simple. I want to train a dataset using random forest on a huge dataset (with $n$ rows). Let's assume I can only fit $b < n$ rows in memory at a time.

Model Choice

I see several options:

Option #1: Bag of Decision Trees

Draw a subsample of $b$ rows from the total dataset (w/o replacement). Fit a decision tree on the subsample. Repeat $s$ times. Average the results of the estimators.

Option #2: Bag of Bagged Decision Trees

Same as above but instead of a decision tree, apply bagging to the subsample.

Option #3: Bag of Random Forests

Same as above but instead, apply a random forest.

Option #4: Forest of Random Forests

Take my dataset, and split it into several subset based on several features (either by unsupervised learning or using some prior knowledge), and train random forests on each subset.

I originally suggested the idea here.

Option #5, #6, & #7: Bag of X with Resampling

Take option #1, #2, or #3, and instead of fitting an estimator on the subsample, resample a new subset from the subsample w/ replacement so that the new subset has $n$ rows.

original subset (n rows) => subsample (b rows) => resample (n rows)

Then apply the estimator to the new subset.

For option #6 and #7, I would resample for each iteration in the ensemble loop.

Discussion

Option #1, #2, and #3 is called Pasting.

Option #1 is computationally the most efficient out of all the other options, but I'm worried that it's accuracy is highly dependent on the choice of $b$. Option #5 might be better, but not by very much.

Option #6 and #7 are similar to Bag of Little Bootstraps (which you can learn more about here).

Question

Out of options above, which is the best method?

My guess is that option #7 is the best.

Thoughts? Comments?

Thanks everyone!

Update

I found this blog post. Looks like they do option #3, mention option #1, and allude to option #7.

Update 2

Best is obviously subjective. Model accuracy (test MSE) is important, but I'm willing to sacrifice some accuracy for greater computation efficiency.

Best Answer

There was a PhD on random forests which included RF on large datasets - it is available at https://github.com/glouppe/phd-thesis I no longer remember the solution proposed but I remember that there was some comparative experiments on different alternatives.