Solved – Fitting a random forest classifier on a large dataset

large datapythonrandom forest

I am currently trying to fit a binary random forest classifier on a large dataset (30+ million rows, 200+ features, in the 25 GB range) in order to variable importance analysis, but I am failing due to memory problems. I was hoping someone here could be of help with possible techniques, alternative solutions, and best practices to do this.

Very appreciated would be:

How to make my approach described below actually work.
If not possible, alternative libraries/methods to do the same thing (possibly working on a dask dataframe). Here I guess maybe tensorflow is a possibility (I haven't tried yet).
If still not possible, alternative approaches to variable importance that can be scaled to very large datasets.

Details

I am reading my dataset using dask.dataframe from a parquet (since anyways the data don't fit in memory). As a model I use sklearn.ensemble.RandomForestClassifier. Additionally, I am playing around with dask.distributed with joblib.parallel_backend('dask').

My hope was that this would exploit dask in order to avoid going over memory, but it doesn't seem to be the case. Here is my code (dataset-specific details omitted):

import dask.dataframe as dd

from sklearn.ensemble import RandomForestClassifier

from dask.distributed import Client
import joblib

# load dask dataframe with the training sample
ddf = dd.read_parquet('my_parquet_file'),
                      index=False)

features = [...]

# random forest classifier
rf_classifier = RandomForestClassifier(n_estimators=16,
                                       criterion='entropy',
                                       n_jobs=-1,
                                       random_state=543,
                                       verbose=True)

with Client(processes=False) as client:
    with joblib.parallel_backend('dask'):
        rf_classifier.fit(ddf[features], ddf['response'])

What I get are a ton of warnings of this form:

distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 11.95 GB -- Worker memory limit: 17.03 GB

And then at the end an error:

 File "C:\Users\Daniel\Documents\GitHub\PIT-TTC-PD\Hyperparameter 

estimation\random_forest_variable_importance.py", line 51, in <module>
    rf_classifier.fit(ddf[features], ddf['response'])

  File "C:\Users\Daniel\anaconda3\lib\site-packages\sklearn\ensemble\_forest.py", line 295, in fit
    X = check_array(X, accept_sparse="csc", dtype=DTYPE)

  File "C:\Users\Daniel\anaconda3\lib\site-packages\sklearn\utils\validation.py", line 531, in check_array
    array = np.asarray(array, order=order, dtype=dtype)

  File "C:\Users\Daniel\anaconda3\lib\site-packages\numpy\core\_asarray.py", line 85, in asarray
    return array(a, dtype, copy=False, order=order)

  File "C:\Users\Daniel\anaconda3\lib\site-packages\dask\dataframe\core.py", line 366, in __array__
    x = np.array(self._computed)

  File "C:\Users\Daniel\anaconda3\lib\site-packages\pandas\core\generic.py", line 1909, in __array__
    return com.values_from_object(self)

  File "pandas\_libs\lib.pyx", line 81, in pandas._libs.lib.values_from_object

  File "C:\Users\Daniel\anaconda3\lib\site-packages\pandas\core\generic.py", line 5487, in values
    return self._data.as_array(transpose=self._AXIS_REVERSED)

  File "C:\Users\Daniel\anaconda3\lib\site-packages\pandas\core\internals\managers.py", line 830, in as_array
    arr = mgr._interleave()

  File "C:\Users\Daniel\anaconda3\lib\site-packages\pandas\core\internals\managers.py", line 848, in _interleave
    result = np.empty(self.shape, dtype=dtype)

MemoryError: Unable to allocate 60.3 GiB for an array with shape (267, 30335674) and data type float64

I tried:

Playing around with the classifier's parameters (eg setting bootstrap=True and max_samples at a low number, thinking that it would only draw a small number of observation at each step, or setting a low max_depth) but to no avail.
Playing around with the Clients parameters, but also without favorable results.

I know I could simply do this on a subsample of the data if nothing works, but I also want to understand how to make this kind of methods work on very large samples, so any help with this would be immensely appreciated.

Best Answer

To fit so much data, you have to use subsamples, for instance tensorflow you sub-sample at each step (using only one batch) and algorithmically speaking you only load one batch at a time in memory it is why it works. Most of the time this is done using a generator instead of the dataset right away. Your problem is that you always load the whole dataset in memory.

To use sub-samples without loading the whole dataset with Random forest, I don't think it is doable using scikit-learn without re-coding part of the library. On the other hand, you can use xgboost and manually do the training part. Here is an example in classification, you can adapt the loss to get an example in regression.

import numpy as np

import xgboost as xgb
from sklearn.datasets import make_blobs
import pandas as pd

# Construct dataset in 1D, dumped in a csv for illustration purpose
X, y = make_blobs(centers= [[0,0], [1,2]],n_samples=10020)
df = pd.DataFrame()
df['feature1']=X[:,0]
df['feature2']=X[:,1]
df['label'] = y.ravel()
features = ['feature1','feature2']

df.to_csv('big_dataset.csv')

# Construct a generator from a csv file. Read chunck of 1000 lines
gen_data = pd.read_csv('big_dataset.csv', chunksize=1000)

class make_model():
    def __init__(self,param,num_round=300):
        self.param=param
        self.num_round=num_round
    def fit(self,gen_data):
        iteration = 0
        
        for df in gen_data:
            dtrain = xgb.DMatrix(np.array(df[features]), label=df['label'])
            if iteration ==0:
                model = xgb.Booster(self.param, [dtrain])
            model = xgb.train(self.param,dtrain,num_boost_round=1, xgb_model=model)
            iteration += 1
            
        self.model_=model
    def predict(self,X):
        dtest=xgb.DMatrix(X)
        return self.model_.predict(dtest)>0.5 # use argmax in non-binary classification
parameters = {'max_depth':5, "booster":"gbtree"} # parameters to tune, see xgboost doc. Can be used to make boosted trees or Random Forests.
model = make_model(parameters) 
model.fit(gen_data)
xgb.plot_importance(model.model_)

Related Solutions

Solved – Scipy.stats.anderson_ksamp negative return values for test statistic

print out X and compare it to (code blocks not formatting right, so I'm not using them):

Y = [iris.data[:,0],iris.data[:,1],iris.data[:,2],iris.data[:,3]]

You'll see that X is a 150x1 array of float64, and Y is a 4-element list.

The input given to your test is different. If you run the AD-ksample test with Y instead of X, you get a very high statistic (205.9) against critical values of .49, 1.3, 1.9, 2.5, and 3.2.

That is to say, you can with great confidence reject the null hypothesis (NH = "all four columns of data are from the same distribution").

Solved – Should I choose Random Forest regressor or classifier

Use the Classifier. No, they are not both valid.

First, I really encourage you to read yourself into the topic of Regression vs Classification. Because using ML without knowing anything about it will give you wrong results which you won't realize. And that's quite dangerous... (it's a little bit like asking which way around you should hold your gun or if it doesn't matter)

Whether you use a classifier or a regressor only depends on the kind of problem you are solving. You have a binary classification problem, so use the classifier.

I could run randomforestregressor first and get back a set of estimated probabilities.

NO. You don't get probabilities from regression. It just tries to "extrapolate" the values you give (in this case only 0 and 1). This means values above 1 or below 0 are perfectly valid as a regression output as it does not expect only two discrete values as output (that's called classification!) but continuous values.

If you want to have the "probabilities" (be aware that these don't have to be well calibrated probabilities) for a certain point to belong to a certain class, train a classifier (so it learns to classify the data) and then use .predict_proba(), which then predicts the probability.

Just to mention it here: .predict vs .predict_proba (for a classifier!)
.predict just takes the .predict_proba output and changes everything to 0 below a certain threshold (usually 0.5) respectively to 1 above that threshold.

Remark: sure, internally, they are the very same except from the "last layer" etc.! Still, see them (or better the problem they are solving) as completely different!

Best Answer

Related Solutions

Solved – Scipy.stats.anderson_ksamp negative return values for test statistic

Solved – Should I choose Random Forest regressor or classifier

Related Question