Solved – Fitting a random forest classifier on a large dataset

large datapythonrandom forest

I am currently trying to fit a binary random forest classifier on a large dataset (30+ million rows, 200+ features, in the 25 GB range) in order to variable importance analysis, but I am failing due to memory problems. I was hoping someone here could be of help with possible techniques, alternative solutions, and best practices to do this.

Very appreciated would be:

  1. How to make my approach described below actually work.
  2. If not possible, alternative libraries/methods to do the same thing (possibly working on a dask dataframe). Here I guess maybe tensorflow is a possibility (I haven't tried yet).
  3. If still not possible, alternative approaches to variable importance that can be scaled to very large datasets.

Details

I am reading my dataset using dask.dataframe from a parquet (since anyways the data don't fit in memory). As a model I use sklearn.ensemble.RandomForestClassifier. Additionally, I am playing around with dask.distributed with joblib.parallel_backend('dask').

My hope was that this would exploit dask in order to avoid going over memory, but it doesn't seem to be the case. Here is my code (dataset-specific details omitted):

import dask.dataframe as dd

from sklearn.ensemble import RandomForestClassifier

from dask.distributed import Client
import joblib

# load dask dataframe with the training sample
ddf = dd.read_parquet('my_parquet_file'),
                      index=False)

features = [...]

# random forest classifier
rf_classifier = RandomForestClassifier(n_estimators=16,
                                       criterion='entropy',
                                       n_jobs=-1,
                                       random_state=543,
                                       verbose=True)

with Client(processes=False) as client:
    with joblib.parallel_backend('dask'):
        rf_classifier.fit(ddf[features], ddf['response'])

What I get are a ton of warnings of this form:

distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 11.95 GB -- Worker memory limit: 17.03 GB

And then at the end an error:

 File "C:\Users\Daniel\Documents\GitHub\PIT-TTC-PD\Hyperparameter 

estimation\random_forest_variable_importance.py", line 51, in <module>
    rf_classifier.fit(ddf[features], ddf['response'])

  File "C:\Users\Daniel\anaconda3\lib\site-packages\sklearn\ensemble\_forest.py", line 295, in fit
    X = check_array(X, accept_sparse="csc", dtype=DTYPE)

  File "C:\Users\Daniel\anaconda3\lib\site-packages\sklearn\utils\validation.py", line 531, in check_array
    array = np.asarray(array, order=order, dtype=dtype)

  File "C:\Users\Daniel\anaconda3\lib\site-packages\numpy\core\_asarray.py", line 85, in asarray
    return array(a, dtype, copy=False, order=order)

  File "C:\Users\Daniel\anaconda3\lib\site-packages\dask\dataframe\core.py", line 366, in __array__
    x = np.array(self._computed)

  File "C:\Users\Daniel\anaconda3\lib\site-packages\pandas\core\generic.py", line 1909, in __array__
    return com.values_from_object(self)

  File "pandas\_libs\lib.pyx", line 81, in pandas._libs.lib.values_from_object

  File "C:\Users\Daniel\anaconda3\lib\site-packages\pandas\core\generic.py", line 5487, in values
    return self._data.as_array(transpose=self._AXIS_REVERSED)

  File "C:\Users\Daniel\anaconda3\lib\site-packages\pandas\core\internals\managers.py", line 830, in as_array
    arr = mgr._interleave()

  File "C:\Users\Daniel\anaconda3\lib\site-packages\pandas\core\internals\managers.py", line 848, in _interleave
    result = np.empty(self.shape, dtype=dtype)

MemoryError: Unable to allocate 60.3 GiB for an array with shape (267, 30335674) and data type float64

I tried:

  • Playing around with the classifier's parameters (eg setting bootstrap=True and max_samples at a low number, thinking that it would only draw a small number of observation at each step, or setting a low max_depth) but to no avail.
  • Playing around with the Clients parameters, but also without favorable results.

I know I could simply do this on a subsample of the data if nothing works, but I also want to understand how to make this kind of methods work on very large samples, so any help with this would be immensely appreciated.

Best Answer

To fit so much data, you have to use subsamples, for instance tensorflow you sub-sample at each step (using only one batch) and algorithmically speaking you only load one batch at a time in memory it is why it works. Most of the time this is done using a generator instead of the dataset right away. Your problem is that you always load the whole dataset in memory.

To use sub-samples without loading the whole dataset with Random forest, I don't think it is doable using scikit-learn without re-coding part of the library. On the other hand, you can use xgboost and manually do the training part. Here is an example in classification, you can adapt the loss to get an example in regression.

import numpy as np

import xgboost as xgb
from sklearn.datasets import make_blobs
import pandas as pd

# Construct dataset in 1D, dumped in a csv for illustration purpose
X, y = make_blobs(centers= [[0,0], [1,2]],n_samples=10020)
df = pd.DataFrame()
df['feature1']=X[:,0]
df['feature2']=X[:,1]
df['label'] = y.ravel()
features = ['feature1','feature2']

df.to_csv('big_dataset.csv')

# Construct a generator from a csv file. Read chunck of 1000 lines
gen_data = pd.read_csv('big_dataset.csv', chunksize=1000)

class make_model():
    def __init__(self,param,num_round=300):
        self.param=param
        self.num_round=num_round
    def fit(self,gen_data):
        iteration = 0
        
        for df in gen_data:
            dtrain = xgb.DMatrix(np.array(df[features]), label=df['label'])
            if iteration ==0:
                model = xgb.Booster(self.param, [dtrain])
            model = xgb.train(self.param,dtrain,num_boost_round=1, xgb_model=model)
            iteration += 1
            
        self.model_=model
    def predict(self,X):
        dtest=xgb.DMatrix(X)
        return self.model_.predict(dtest)>0.5 # use argmax in non-binary classification
parameters = {'max_depth':5, "booster":"gbtree"} # parameters to tune, see xgboost doc. Can be used to make boosted trees or Random Forests.
model = make_model(parameters) 
model.fit(gen_data)
xgb.plot_importance(model.model_)