MATLAB: GPU optimization of looped vector operations

I think I am making a simple mistake. I am comparing a vectrorized integration using the GPU and the CPU. This code takes ~365 sec on the GPU (Nvidia quadro P5200 and 96 sec on the parallel CPU (6 workers, Xeon E-2176M, 2.7 GHz). The integral is a straight forward operation with vectors 90,000 long in this example that repeats 90,000 times in the loop. A test of an array multiplication of two 10,000×10,000 arrays of random numbers takes 0.65 s on my GPU and 10.8 s on my CPU. In the example below the GPU is slower for larger arrays. It seems as though the loop introduces a lot of overhead on the GPU operations.

What is the best strategy to optimize this problem for the GPU?

nubar_low = 2450;
nubar_high = 3350;
p_density = 100; %points per wavenumber
nu_bar = nubar_low:1/p_density:nubar_high;
K = zeros(size(nu_bar));
nub = nu_bar;
n_inf = 0;
nub = nu_bar;
k_max = 0.01; %max k
nub_0 = 2800; %nu bar center to absorption
gamma = 50; %width of the absortion
K = k_max * (gamma/2)^2 * ( ((nub-nub_0).^2 + (gamma/2)^2).^-1 - ((nub+nub_0).^2 + (gamma/2)^2).^-1);
% dK data is the derivative of K --> d(K)/d(nubar)
% Use value on either side of the point where possible
dK = zeros(size(K));
dK(2:end-1) = (K(3:end)-K(1:end-2))./(nu_bar(3:end)-nu_bar(1:end-2));
% Endpoints are special case.
dK(1) = (K(2)-K(1))./(nu_bar(2)-nu_bar(1));
dK(end) = (K(end)-K(end-1))./(nu_bar(end)-nu_bar(end-1));
len=length(nu_bar);
dN_KK = zeros(1,len);
% The integral
tic
  try
      canUseGPU = parallel.gpu.GPUDevice.isAvailable;
  catch ME
      canUseGPU = false;
  end
  
  %canUseGPU = false;
  
  if canUseGPU  
    %integral using GPU

    gnu_bar = gpuArray(nu_bar);
    gK = gpuArray(K);
    gdK = gpuArray(dK);
    gdN_KK = gpuArray(dN_KK);
    for i = 1:len
        gdN_KK(i) = sum(gnu_bar([1:i-1, i+1:end]) .* gK([1:i-1, i+1:end]) ./ (gnu_bar([1:i-1, i+1:end]).^2 - gnu_bar(i).^2));
        gdN_KK(i) = 2*gdN_KK(i) + gK(i)./(2*gnu_bar(i)) + gdK(i);
    end
    dN_KK =gather(gdN_KK);
  else
    %integral using GPU
    parfor i = 1:len 
        dN_KK(i) = sum(nu_bar([1:i-1, i+1:end]) .* K([1:i-1, i+1:end]) ./ (nu_bar([1:i-1, i+1:end]).^2 - nu_bar(i).^2));
        dN_KK(i) = 2*dN_KK(i) + K(i)./(2*nu_bar(i)) + dK(i);
    end
  end
        
        
% Scales data     
dN_KK = (1/(pi*p_density))*dN_KK;
% Adds constant for N infinity
N_KK = dN_KK + n_inf;
toc

Best Answer

This modification uses mat2tiles from the File Exchange, to help divide the computation into bigger, vectorized chunks

https://www.mathworks.com/matlabcentral/fileexchange/35085-mat2tiles-divide-array-into-equal-sized-sub-arrays

It runs in about 2 seconds on my graphics card (GeForce GTX 1080 Ti). Aside from increased vectorization, the key is to eliminate all the indexing expressions x([1:i-1, i+1:end]). Those are costly.

    tic;
    gnu_bar = gpuArray(nu_bar);
    gK = gpuArray(K);
    gdK = gpuArray(dK);
    gdN_KK = gpuArray(dN_KK);
    
    
    chunksize=1000;
    
    vv=gnu_bar.^2;
    vvchunks=mat2tiles(  vv , [1,chunksize]);
    numer=gnu_bar.*gK;
    c=1;
    
    for k=1:numel(vvchunks)
        Q=numer(:)./(vv.'-vvchunks{k});
        Q(c:len+1:end)=0;
        c=c+size(Q,2);
        vvchunks{k}=sum(Q,1);
    end
    
    gdN_KK=[vvchunks{:}];
    gdN_KK = 2*gdN_KK + gK./(2*gnu_bar) + gdK;    
    
    wait(gd)
    toc %Elapsed time is 2.027665 seconds.

Best Answer

Related Solutions

MATLAB: How to generate a function using symbolic expression and plot that

MATLAB: Setting up vector function

Related Question