MATLAB: GPU optimization of looped vector operations

gpuparallel computingParallel Computing Toolbox

I think I am making a simple mistake. I am comparing a vectrorized integration using the GPU and the CPU. This code takes ~365 sec on the GPU (Nvidia quadro P5200 and 96 sec on the parallel CPU (6 workers, Xeon E-2176M, 2.7 GHz). The integral is a straight forward operation with vectors 90,000 long in this example that repeats 90,000 times in the loop. A test of an array multiplication of two 10,000×10,000 arrays of random numbers takes 0.65 s on my GPU and 10.8 s on my CPU. In the example below the GPU is slower for larger arrays. It seems as though the loop introduces a lot of overhead on the GPU operations.
What is the best strategy to optimize this problem for the GPU?
nubar_low = 2450;
nubar_high = 3350;
p_density = 100; %points per wavenumber
nu_bar = nubar_low:1/p_density:nubar_high;
K = zeros(size(nu_bar));
nub = nu_bar;
n_inf = 0;
nub = nu_bar;
k_max = 0.01; %max k
nub_0 = 2800; %nu bar center to absorption
gamma = 50; %width of the absortion
K = k_max * (gamma/2)^2 * ( ((nub-nub_0).^2 + (gamma/2)^2).^-1 - ((nub+nub_0).^2 + (gamma/2)^2).^-1);
% dK data is the derivative of K --> d(K)/d(nubar)
% Use value on either side of the point where possible
dK = zeros(size(K));
dK(2:end-1) = (K(3:end)-K(1:end-2))./(nu_bar(3:end)-nu_bar(1:end-2));
% Endpoints are special case.
dK(1) = (K(2)-K(1))./(nu_bar(2)-nu_bar(1));
dK(end) = (K(end)-K(end-1))./(nu_bar(end)-nu_bar(end-1));
len=length(nu_bar);
dN_KK = zeros(1,len);
% The integral
tic
try
canUseGPU = parallel.gpu.GPUDevice.isAvailable;
catch ME
canUseGPU = false;
end
%canUseGPU = false;
if canUseGPU
%integral using GPU

gnu_bar = gpuArray(nu_bar);
gK = gpuArray(K);
gdK = gpuArray(dK);
gdN_KK = gpuArray(dN_KK);
for i = 1:len
gdN_KK(i) = sum(gnu_bar([1:i-1, i+1:end]) .* gK([1:i-1, i+1:end]) ./ (gnu_bar([1:i-1, i+1:end]).^2 - gnu_bar(i).^2));
gdN_KK(i) = 2*gdN_KK(i) + gK(i)./(2*gnu_bar(i)) + gdK(i);
end
dN_KK =gather(gdN_KK);
else
%integral using GPU
parfor i = 1:len
dN_KK(i) = sum(nu_bar([1:i-1, i+1:end]) .* K([1:i-1, i+1:end]) ./ (nu_bar([1:i-1, i+1:end]).^2 - nu_bar(i).^2));
dN_KK(i) = 2*dN_KK(i) + K(i)./(2*nu_bar(i)) + dK(i);
end
end
% Scales data
dN_KK = (1/(pi*p_density))*dN_KK;
% Adds constant for N infinity
N_KK = dN_KK + n_inf;
toc

Best Answer

This modification uses mat2tiles from the File Exchange, to help divide the computation into bigger, vectorized chunks
It runs in about 2 seconds on my graphics card (GeForce GTX 1080 Ti). Aside from increased vectorization, the key is to eliminate all the indexing expressions x([1:i-1, i+1:end]). Those are costly.
tic;
gnu_bar = gpuArray(nu_bar);
gK = gpuArray(K);
gdK = gpuArray(dK);
gdN_KK = gpuArray(dN_KK);
chunksize=1000;
vv=gnu_bar.^2;
vvchunks=mat2tiles( vv , [1,chunksize]);
numer=gnu_bar.*gK;
c=1;
for k=1:numel(vvchunks)
Q=numer(:)./(vv.'-vvchunks{k});
Q(c:len+1:end)=0;
c=c+size(Q,2);
vvchunks{k}=sum(Q,1);
end
gdN_KK=[vvchunks{:}];
gdN_KK = 2*gdN_KK + gK./(2*gnu_bar) + gdK;
wait(gd)
toc %Elapsed time is 2.027665 seconds.