MATLAB: 3D gpuArray vs cells of 2D gpuArrays major speed difference!

gpuarrayParallel Computing Toolboxslicingspeed

Can anybody explain why these codes have drastically different runtimes?

I have a shared setup routine

clear all
y = gpuArray.rand(1000, 1000, 'single');
W = cell(1, 5);
WFull = gpuArray.zeros(1000, 1000, 5);
for j = 1:5
   W{j} = gpuArray.rand(1000, 1000, 'single');
   WFull(:,:,j) = W{j};
end

Version 1 (finishes in 1.4 seconds on my machine)

z = gpuArray.zeros(1000, 1000, 5);
tic
for i = 1:1000
   for j = 1:size(W)
      z(:,:,j) = W{j}*y;
   end
end
toc

vs. Version 2 (finishes in 39 seconds on my machine… 27x times slower)

z = gpuArray.zeros(1000, 1000, 5);
tic
for i = 1:1000
   for j = 1:size(WFull, 3)
      z(:,:,j) = WFull(:,:,j)*y;
   end
end
toc

Do you think that slicing large 3D gpuArrays is just really slow compared to looking up cell array values?

Best Answer

Do you think that slicing large 3D gpuArrays is just really slow compared to looking up cell array values?

Yes, it is faster to look-up a cell than to pull a slice out of a 3D array, and that's true for normal arrays as well, as long as there is a small number of slices/cells. Of course, you should really be including the time needed to allocate memory to each W{j} in your comparison.

Another reason is that you have a syntax error in your for-loop over W{j}. It's only doing 1 loop iteration instead of 5,

   >> for j=1:size(W), j, end 
j =
       1

This is biasing the comparison to some degree.

Related Solutions

MATLAB: CPU vs GPU – Is it reasonable

While I can't comment exactly on the CPU/GPU comparison for your specific setup, I can say that the general rule of thumb in GPU computing is that operating on as much data as possible in a single call by vectorizing your code will provide the best performance. With that said, since you are already using the bsxfun function I would go a step further and use the following code to avoid the for loop and operate on all of your data in a single call.

Also, the timeit/gputimeit functions are the best choice for comparing CPU and GPU execution as they each provide an average time over multiple runs. Furthermore, gputimeit takes into account the fact that GPU operations perform asynchronously, while tic/toc does not.

x1GPU=gpuArray.randn(3,1,10000);
x2GPU=gpuArray.randn(3,5000);
gputimeit(@()bsxfun(@minus, x1GPU, x2GPU))
x1CPU=randn(3,1,10000);
x2CPU=randn(3,5000);
timeit(@()bsxfun(@minus, x1CPU, x2CPU))

I can confirm that when I run the code you provided, it takes about 4 seconds on the GPU and 1.5 seconds on the CPU. For the code that I have provided, gputimeit reports an average time of 0.0855 seconds on the GPU while timeit reports (again) an average of 1.5 seconds on the CPU.

The bottom line is that CPU for loops combined with GPU computing in the loop body generally does not provide the best performance. You should always try to replace code like this with a vectorized version if possible.

MATLAB: Matrix multiply slices of 3d Matricies

If you have MATLAB R2013b, you can use the new gpuArray pagefun function like so:

C = pagefun(@mtimes, A, B);

Best Answer

Related Solutions

MATLAB: CPU vs GPU – Is it reasonable

MATLAB: Matrix multiply slices of 3d Matricies

Related Question