MATLAB: SPMD vs PARFOR and Memory Usage

memoryparforperformancespmd

Dear all,

I am trying to migrate some of the code I have which utilizes parfor to spmd in order to use the codistributed array features and save considerable memory (because parfor is copying the huge matrix var2 into all the workers). So I was hoping to distribute the matrix var2 between all the workers and end up saving some memory. I am running a toy example on our linux servers using the following function:

function accum_results=try_memory_smpd()
tic
var1=repmat(linspace(1,100,100),100,1);
var2=2*linspace(0.001,0.02,600000);
accum_results_temp=zeros(100,600000);
upper_bound=100;
lower_bound=1;
spmd
    D = codistributed(var1,codistributor('1d', 2));
    temp=getLocalPart(D);
    globalInd=globalIndices(D, 2);
    local_lower_bound = find(globalInd == lower_bound, 1);
    if ~isempty(local_lower_bound)
        fprintf('The lower bound found in Lab %d, indice %d\n',labindex,local_lower_bound);
    end
    if isempty(local_lower_bound) && min(globalInd)>lower_bound
        local_lower_bound=1;
    end
    local_upper_bound = find(globalInd == upper_bound, 1);
    if ~isempty(local_upper_bound)
        fprintf('The upper bound found in Lab %d, indice %d\n',labindex,local_upper_bound);
    end
    if isempty(local_upper_bound) && max(globalInd)<upper_bound
        local_upper_bound=size(temp,2);
    end
      if ~(isempty(local_upper_bound) || isempty(local_lower_bound))
          for j = local_lower_bound:local_upper_bound
              accum_results_temp = accum_results_temp+bsxfun(@times,var2,temp(:,j));
          end
          fprintf('Lab %d works between indice %d and %d \n',labindex,local_lower_bound,local_upper_bound);
      else
          fprintf('No work for Lab %d!!\n',labindex);
      end
      D=[];
      temp=[];
      var2=[];
  end
  accum_results=zeros(100,600000);
  for cell_ind=1:length(accum_results_temp)
      accum_results=accum_results+accum_results_temp{cell_ind};
  end
  toc
  end

Note that the sizes are fairly large and you may need to change the matrix sizes. Anyway, when I profile the code, it seems the bottleneck is mainly caused by the final for loop which adds up all the cell entries in the composite object returned by the spmd block (therefore the resulting matrix is 100×600000). I also note that the PARFOR implementation of the same code finishes in about half the time. The additional functionality added in the spmd code (if-checks etc) has no visible impact on performance. Using methods such as cell2mat etc. will defeat the purpose of the spmd implementation since it will create a copy of the data already stored on the workers. I'd be very grateful if someone can give me an idea/inspiration such that I can get away with using parallel code without excessive memory usage. Thanks in advance.

Cem

P.S. Here's the PARFOR implementation:

function accum_results=try_memory()
tic
var1=repmat(linspace(1,100,100),100,1);
var2=2*linspace(0.001,0.02,600000);
accum_results=zeros(100,600000);
parfor i=1:100
    accum_results=accum_results+bsxfun(@times,var2,var1(:,i));
end
toc
end

Best Answer

Here's a version that I've reworked quite a bit to use codistributed arrays hopefully a little more effectively (it runs about twice as quickly on my machine here).

tic
N = 100;
M = 600000;
spmd
    % Build 'var1' on the workers directly to avoid communication
    var1=linspace(1,N,N);
    % Build 'var2' directly in codistributed form to save memory
    var2=2 * codistributed.linspace(0.001,0.02,M);
    %



    % We will operate on the local part of var2
    var2_lp = getLocalPart(var2);
    var2_cod = getCodistributor(var2);
    %
    % Build accum_results as a codistributed array directly, and ensure
    % it uses the same codistributor as var2 so that we can operate
    % on the local parts of the arrays together.
    accum_results = codistributed.zeros(N, M, var2_cod);
    %
    % Get the local part out of accum_results so that we can operate on it directly.
    ar_local = getLocalPart(accum_results);
    ar_codist = getCodistributor(accum_results);
    %
    % Loop 1:N applying the BSXFUN to the relevant local parts
    for idx = 1:N
        ar_local = ar_local + bsxfun(@times, var2_lp, repmat(var1(idx), N, 1));
    end
    % Put accum_results back together
    accum_results = codistributed.build(ar_local, ar_codist);
end
% Gather the results back to the host
accum_results=gather(accum_results);
toc

Related Solutions

MATLAB: Does “matlabpool open local 4” fail when using the MPD build of the MPICH2 library

These errors are caused by an incompatibility between the MPD build of MPICH2 and the Local Scheduler.

To work around this issue, additional logic must be inserted into the 'mpiLibConf.m' file to only select the MPD build when not using the Local Scheduler. For example, if you have already modified the mpiLibConf.m file to use the MPD build of the library, consider adding the following:

if strcmp( getenv( 'MDCE_DECODE_FUNCTION' ), 'decodeLocalParallelTask' )
      % Local scheduler - don't use the MPD build
      warning( 'Local scheduler: about to use default installed MPICH2 build' );
      if ismac
          extras = {'libmpich.dylib'};
          primaryLib = 'libpmpich.dylib';
      else
          primaryLib = 'libmpich.so';
      end
end

This should select the default MPI libraries if the Local Scheduler is used.

MATLAB: Distributing arrays to workers for local processing

Hello. If you are able to successfully open a matlabpool with your installation of R2011b, then you must have the Parallel Computing Toolbox. In that case, the getLocalPart function should also be available to you. What is the output from typing the following at the MATLAB command line:

which getLocalPart

Assuming that you can get the issue with getLocalPart sorted out (perhaps by calling technical support), this is how you would proceed with distributed arrys/spmd:

matlabpool open 100 % this will open 100 workers 
                    % using your default configuration
% I assume that myMat was already loaded as a standard MATLAB array
size(myMat)     % You've stated that myMat is 800000 x 2     
% There are a lot of rows, so let's use codistributor1d to 
% distribute the rows across all the workers in the pool.  This must
% be done inside the spmd block because that's where 
% codistributed arrays and codistributors live.
spmd 
  codist = codistributor1d(1); % Create a scheme to distribute the first
                               % dimension of a matrix (its rows) as evenly as
                               % possible across all the workers in the 
                               % pool    
  myMatdb = codistributed(myMat, codist);  % Use the scheme to create 
                                           % distributed data
  chunk_of_data = getLocalPart(myMatdb);   % Each worker operates on its data 
  [out_of_chunk] = objFun(params, chunk_of_data);
  fullOutput = codistributed.build(out_of_chunk, codist); % Create a new 
                                                          % array from the 
                                                          % local outputs. I
                                                          % assume that
                                                          % out_of_chunk is
                                                          % the same size as 
                                                          % chunk_of_data on
                                                          % each worker so
                                                          % that the
                                                          % codistributor can
                                                          % be reused.
end  
% fullOutput and myMatdb can be used as distributed arrays outside of the spmd block

You can find more information here:

help getLocalPart
help codistributor.build

Best Answer

Related Solutions

MATLAB: Does “matlabpool open local 4” fail when using the MPD build of the MPICH2 library

MATLAB: Distributing arrays to workers for local processing

Related Question