I think I am making a simple mistake. I am comparing a vectrorized integration using the GPU and the CPU. This code takes ~365 sec on the GPU (Nvidia quadro P5200 and 96 sec on the parallel CPU (6 workers, Xeon E-2176M, 2.7 GHz). The integral is a straight forward operation with vectors 90,000 long in this example that repeats 90,000 times in the loop. A test of an array multiplication of two 10,000×10,000 arrays of random numbers takes 0.65 s on my GPU and 10.8 s on my CPU. In the example below the GPU is slower for larger arrays. It seems as though the loop introduces a lot of overhead on the GPU operations.
What is the best strategy to optimize this problem for the GPU?
nubar_low = 2450;nubar_high = 3350;p_density = 100; %points per wavenumber
nu_bar = nubar_low:1/p_density:nubar_high;K = zeros(size(nu_bar));nub = nu_bar;n_inf = 0;nub = nu_bar;k_max = 0.01; %max k
nub_0 = 2800; %nu bar center to absorption
gamma = 50; %width of the absortion
K = k_max * (gamma/2)^2 * ( ((nub-nub_0).^2 + (gamma/2)^2).^-1 - ((nub+nub_0).^2 + (gamma/2)^2).^-1);% dK data is the derivative of K --> d(K)/d(nubar)
% Use value on either side of the point where possible
dK = zeros(size(K));dK(2:end-1) = (K(3:end)-K(1:end-2))./(nu_bar(3:end)-nu_bar(1:end-2));% Endpoints are special case.
dK(1) = (K(2)-K(1))./(nu_bar(2)-nu_bar(1));dK(end) = (K(end)-K(end-1))./(nu_bar(end)-nu_bar(end-1));len=length(nu_bar);dN_KK = zeros(1,len);% The integral
tic try canUseGPU = parallel.gpu.GPUDevice.isAvailable; catch ME canUseGPU = false; end %canUseGPU = false;
if canUseGPU %integral using GPU
gnu_bar = gpuArray(nu_bar); gK = gpuArray(K); gdK = gpuArray(dK); gdN_KK = gpuArray(dN_KK); for i = 1:len gdN_KK(i) = sum(gnu_bar([1:i-1, i+1:end]) .* gK([1:i-1, i+1:end]) ./ (gnu_bar([1:i-1, i+1:end]).^2 - gnu_bar(i).^2)); gdN_KK(i) = 2*gdN_KK(i) + gK(i)./(2*gnu_bar(i)) + gdK(i); end dN_KK =gather(gdN_KK); else %integral using GPU parfor i = 1:len dN_KK(i) = sum(nu_bar([1:i-1, i+1:end]) .* K([1:i-1, i+1:end]) ./ (nu_bar([1:i-1, i+1:end]).^2 - nu_bar(i).^2)); dN_KK(i) = 2*dN_KK(i) + K(i)./(2*nu_bar(i)) + dK(i); end end % Scales data
dN_KK = (1/(pi*p_density))*dN_KK;% Adds constant for N infinity
N_KK = dN_KK + n_inf;toc
Best Answer