MATLAB: For loop is running 100x times slower on GPU than on CPU

gpu parallelism

So I'm in the process of shifting some code over from CPU to GPU, and I ran into a weird issue where a loop on the GPU was running extremely slow. I have provided a simplified code snippet that captures my issue:
tic;
d=gpuArray(0);
n=gpuArray(100);
P=gpuArray(rand(3, 100));
x = toc;
fprintf("Allocation time is %f\n", x);
tic;
for j = 1:n
for k = 1:n
d = d + (n^-2)*norm(P(:,j) - P(:,k));
end
end
x=toc;
fprintf("Loop time is %f\n", x);
Allocation time is 0.007388
Loop time is 5.119713
…I'm a little confused. This loop is taking 5 seconds on the GPU, but if I run it on the CPU it takes 0.03 seconds.
Any thoughts? All of my data is gpuArray(), and norm() is a gpu-compatible built-in.
Thanks.

Best Answer

I realized that I need to write more parallelizable code.
tic;
d=gpuArray(0);
n=gpuArray(100);
P=gpuArray(rand(3, 100));
x = toc;
fprintf("Allocation time is %f\n", x);
tic;
d=sum(triu(pdist2(P,P),1));
x=toc;
fprintf("Loop time is %f\n", x);
Allocation time is 0.001666
Loop time is 0.012691