I am currently trying to fit a binary random forest classifier on a large dataset (30+ million rows, 200+ features, in the 25 GB range) in order to variable importance analysis, but I am failing due to memory problems. I was hoping someone here could be of help with possible techniques, alternative solutions, and best practices to do this.
Very appreciated would be:
- How to make my approach described below actually work.
- If not possible, alternative libraries/methods to do the same thing (possibly working on a
dask
dataframe). Here I guess maybetensorflow
is a possibility (I haven't tried yet). - If still not possible, alternative approaches to variable importance that can be scaled to very large datasets.
Details
I am reading my dataset using dask.dataframe
from a parquet (since anyways the data don't fit in memory). As a model I use sklearn.ensemble.RandomForestClassifier
. Additionally, I am playing around with dask.distributed
with joblib.parallel_backend('dask')
.
My hope was that this would exploit dask
in order to avoid going over memory, but it doesn't seem to be the case. Here is my code (dataset-specific details omitted):
import dask.dataframe as dd
from sklearn.ensemble import RandomForestClassifier
from dask.distributed import Client
import joblib
# load dask dataframe with the training sample
ddf = dd.read_parquet('my_parquet_file'),
index=False)
features = [...]
# random forest classifier
rf_classifier = RandomForestClassifier(n_estimators=16,
criterion='entropy',
n_jobs=-1,
random_state=543,
verbose=True)
with Client(processes=False) as client:
with joblib.parallel_backend('dask'):
rf_classifier.fit(ddf[features], ddf['response'])
What I get are a ton of warnings of this form:
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk. Perhaps some other process is leaking memory? Process memory: 11.95 GB -- Worker memory limit: 17.03 GB
And then at the end an error:
File "C:\Users\Daniel\Documents\GitHub\PIT-TTC-PD\Hyperparameter
estimation\random_forest_variable_importance.py", line 51, in <module>
rf_classifier.fit(ddf[features], ddf['response'])
File "C:\Users\Daniel\anaconda3\lib\site-packages\sklearn\ensemble\_forest.py", line 295, in fit
X = check_array(X, accept_sparse="csc", dtype=DTYPE)
File "C:\Users\Daniel\anaconda3\lib\site-packages\sklearn\utils\validation.py", line 531, in check_array
array = np.asarray(array, order=order, dtype=dtype)
File "C:\Users\Daniel\anaconda3\lib\site-packages\numpy\core\_asarray.py", line 85, in asarray
return array(a, dtype, copy=False, order=order)
File "C:\Users\Daniel\anaconda3\lib\site-packages\dask\dataframe\core.py", line 366, in __array__
x = np.array(self._computed)
File "C:\Users\Daniel\anaconda3\lib\site-packages\pandas\core\generic.py", line 1909, in __array__
return com.values_from_object(self)
File "pandas\_libs\lib.pyx", line 81, in pandas._libs.lib.values_from_object
File "C:\Users\Daniel\anaconda3\lib\site-packages\pandas\core\generic.py", line 5487, in values
return self._data.as_array(transpose=self._AXIS_REVERSED)
File "C:\Users\Daniel\anaconda3\lib\site-packages\pandas\core\internals\managers.py", line 830, in as_array
arr = mgr._interleave()
File "C:\Users\Daniel\anaconda3\lib\site-packages\pandas\core\internals\managers.py", line 848, in _interleave
result = np.empty(self.shape, dtype=dtype)
MemoryError: Unable to allocate 60.3 GiB for an array with shape (267, 30335674) and data type float64
I tried:
- Playing around with the classifier's parameters (eg setting
bootstrap=True
andmax_samples
at a low number, thinking that it would only draw a small number of observation at each step, or setting a lowmax_depth
) but to no avail. - Playing around with the
Client
s parameters, but also without favorable results.
I know I could simply do this on a subsample of the data if nothing works, but I also want to understand how to make this kind of methods work on very large samples, so any help with this would be immensely appreciated.
Best Answer
To fit so much data, you have to use subsamples, for instance tensorflow you sub-sample at each step (using only one batch) and algorithmically speaking you only load one batch at a time in memory it is why it works. Most of the time this is done using a generator instead of the dataset right away. Your problem is that you always load the whole dataset in memory.
To use sub-samples without loading the whole dataset with Random forest, I don't think it is doable using scikit-learn without re-coding part of the library. On the other hand, you can use xgboost and manually do the training part. Here is an example in classification, you can adapt the loss to get an example in regression.