Dear all,
I am trying to migrate some of the code I have which utilizes parfor to spmd in order to use the codistributed array features and save considerable memory (because parfor is copying the huge matrix var2 into all the workers). So I was hoping to distribute the matrix var2 between all the workers and end up saving some memory. I am running a toy example on our linux servers using the following function:
function accum_results=try_memory_smpd()ticvar1=repmat(linspace(1,100,100),100,1);var2=2*linspace(0.001,0.02,600000);accum_results_temp=zeros(100,600000);upper_bound=100;lower_bound=1;spmd D = codistributed(var1,codistributor('1d', 2)); temp=getLocalPart(D); globalInd=globalIndices(D, 2); local_lower_bound = find(globalInd == lower_bound, 1); if ~isempty(local_lower_bound) fprintf('The lower bound found in Lab %d, indice %d\n',labindex,local_lower_bound); end if isempty(local_lower_bound) && min(globalInd)>lower_bound local_lower_bound=1; end local_upper_bound = find(globalInd == upper_bound, 1); if ~isempty(local_upper_bound) fprintf('The upper bound found in Lab %d, indice %d\n',labindex,local_upper_bound); end if isempty(local_upper_bound) && max(globalInd)<upper_bound local_upper_bound=size(temp,2); end if ~(isempty(local_upper_bound) || isempty(local_lower_bound)) for j = local_lower_bound:local_upper_bound accum_results_temp = accum_results_temp+bsxfun(@times,var2,temp(:,j)); end fprintf('Lab %d works between indice %d and %d \n',labindex,local_lower_bound,local_upper_bound); else fprintf('No work for Lab %d!!\n',labindex); end D=[]; temp=[]; var2=[]; end accum_results=zeros(100,600000); for cell_ind=1:length(accum_results_temp) accum_results=accum_results+accum_results_temp{cell_ind}; end toc end
Note that the sizes are fairly large and you may need to change the matrix sizes. Anyway, when I profile the code, it seems the bottleneck is mainly caused by the final for loop which adds up all the cell entries in the composite object returned by the spmd block (therefore the resulting matrix is 100×600000). I also note that the PARFOR implementation of the same code finishes in about half the time. The additional functionality added in the spmd code (if-checks etc) has no visible impact on performance. Using methods such as cell2mat etc. will defeat the purpose of the spmd implementation since it will create a copy of the data already stored on the workers. I'd be very grateful if someone can give me an idea/inspiration such that I can get away with using parallel code without excessive memory usage. Thanks in advance.
Cem
P.S. Here's the PARFOR implementation:
function accum_results=try_memory()ticvar1=repmat(linspace(1,100,100),100,1);var2=2*linspace(0.001,0.02,600000);accum_results=zeros(100,600000);parfor i=1:100 accum_results=accum_results+bsxfun(@times,var2,var1(:,i));endtocend
Best Answer