MATLAB: TreeBagger using obscene amount of memory when run in parallel

nonparametric supervised learning random forestout of memorytreebagger

Hi,
Im experiencing issues when running TreeBagger on a cluster. I run this code on a large cluster with 64 processors and 128 GB of memory. However, when I try to use TreeBagger on my dataset (~200 MB in size) with 5000 trees, matlab errors out after a few hours with OUT of MEMORY issues.
Here are my steps:
1. send a batch job to the cluster via distributed computing toolbox and open a matlabpool with 32 workers.
2. options = statset('UseParallel', 'Always');
3. B= TreeBagger(ntrees, tsp, tsp_label, 'Fboot', fboot, 'Options', options); where ntrees = 5000 and fboot=0.5.
I dont understand why TreeBagger is using so much memory (>128GB). When I run this same job locally on my 16GB computer, the memory use does not exceed 16GB. Am I doing something improperly?
Thanks for your help!

Best Answer

Nicholas,
Each worker in the matlabpool is a separate matlab executable with its own working memory. In the case of TreeBagger, each worker has a separate copy of the TreeBagger data, which includes your full dataset, and eventually, all or most of the trees, plus any additional object contents. Thus, for TreeBagger, total memory consumption tends to increase quasi-linearly with the size of the matlabpool.
If you run in serial mode on your own computer, there is only one copy of this memory. (Though if you run in parallel on K cores locally, there will be K copies of the data.)
You might try to run with a smaller matlabpool if total memory consumption across the matlabpool is a limiting factor.
Best,
Steve