MATLAB: [GPU] Why do GFLOPS of element-wise matrix operations (addition, multiplication) seem to scale poorly as compared to e.g. mtimes

gflopsgpugpubenchoptimizationperformancesingle

Hello all,
BLUF: Why do element-wise operations on the GPU seem to scale more slowly than operations like mtimes; even if A+B runs in less absolute time than A*B, the relative speed does not seem to scale as I would expect from computational complexity. Thus, A*B is "faster" than A+B. Why?
I am trying to understand some Matlab speed bottlenecks on the GPU using the techniques I found here (Measuring GPU Performance) as well as GPUbench.
What I do not understand is why element-wise array operations appear slow as compared to mtimes and other benchmark measures using matrices of size (N x N). For example, for now I am going to assume that the number of floating point operations for element operations scales linearly with the total number of points (N^2), while for Mtimes I will use the standard (2*N.^3 – N.^ 2).
To benchmark these, I run the following:
sizes = power(2, 12:2:24);
N = sqrt(sizes);
mmTimesHost = inf(size(sizes));
mmTimesRtimes = inf(size(sizes));
mmTimesAddtimes = inf(size(sizes));
mmTimesMtimes = inf(size(sizes));
for ii=1:numel(sizes)
ii
A = single(rand( N(ii), N(ii) ));
B = single(rand( N(ii), N(ii) ));
A = gpuArray(A);
B = gpuArray(B);
mmTimesRtimes(ii) = gputimeit(@() A.*B);
mmTimesMtimes(ii) = gputimeit(@() A*B);
mmTimesAddtimes(ii) = gputimeit(@() A+B);
end
semilogx(sizes,sizes./mmTimesRtimes/1E9, 'b.-',sizes,(2*N.^3 - N.^ 2)./mmTimesMtimes/1E9, 'r.-',...
sizes,sizes./mmTimesAddtimes/1E9,'g.-');
grid on
title('Single precision matrix-matrix rtimes')
xlabel('Matrix size (numel)')
ylabel('Calculation Rate (GFLOPS)')
legend('Rtimes', 'Mtimes', '+','Location', 'NorthWest')
In the resulting plot, I get a maximum ~= 7000 GFLOPS for mtimes (A*B), which jives with GPUbench …the element-wise operations are at the very bottom of the scale.
Speed.png
I can see them if I set
ylim([0 20])
and see that (assuming my scaling is right) I peg out at 20 GFLOPS which seems insanely slow. I want to know why; I can only think of four (good) options:
  1. I have misunderstood how the number of FLOPs scales with the number of elements involved. I don't see how this could be true for element-wise operations like (+). Although I could be missing a scale factor, I do not believe it is 300 as the above speeds suggest, and I don't see why the number of operations should scale more than linearly.
  2. There is something inherently "faster" for mtimes than for (+).
  3. There is something different in the Matlab implementation of these functions on the GPU, such that for memory or other reasons mtimes has been optimized, but (+) and A.*B have not.
  4. There is some implementation issue on my end that I am missing (I did try converting A and B to colums rather than matrices, no dice).
There are probably other things I am not thinking about too, but at the end of the day I want to know:
  • why element-wise operations seem so much slower than it seems they should be?
  • Is there a way to improve the speed/scaling of these operations? (maybe using the new GPU coder, custom CUDA code, changing the properties of my gpuDevice, or batching my data in a certain way).
Cheers,
-DP

Best Answer

The main factor here is that MTIMES (i.e. matrix-matrix multiplication) is compute bound, where as PLUS and TIMES (element-wise operations) are memory bound. This means that for both PLUS and TIMES, the memory system on the GPU simply cannot feed the CUDA cores quickly enough for them to be limited by the amount of floating-point operations they need to perform. You can read more about this here: https://en.wikipedia.org/wiki/Roofline_model#Arithmetic_intensity . For a GPU, the memory transfer rate from the main memory to the computational cores is slow enough that the computational cores need to be performing ~50-100 FLOPs per element (roughly... it changes a lot based on the GPU generation - and it's getting worse - modern GPUs need to perform way more FLOPs per element to be compute-bound).
I modified your example a bit to show the limiting factor for PLUS and TIMES, which is the device memory bandwidth (this was run on my somewhat old Tesla K20c, using R2019a)
sizes = power(2, 12:2:24);
N = sqrt(sizes);
mmTimesHost = inf(size(sizes));
mmTimesRtimes = inf(size(sizes));
mmTimesAddtimes = inf(size(sizes));
mmTimesMtimes = inf(size(sizes));
for ii=1:numel(sizes)
A = rand(N(ii), 'single', 'gpuArray');
B = rand(N(ii), 'single', 'gpuArray');
mmTimesRtimes(ii) = gputimeit(@() A.*B);
mmTimesMtimes(ii) = gputimeit(@() A*B);
mmTimesAddtimes(ii) = gputimeit(@() A+B);
end
%% Compute-bound - MTIMES
subplot(2, 1, 1)
mtimesGflops = (2*N.^3 - N.^ 2)./mmTimesMtimes/1E9;
semilogx(sizes, mtimesGflops, 'r.-');
grid on
title('Single precision MTIMES')
xlabel('Matrix size (numel)')
ylabel('Calculation Rate (GFLOPS)')
legend('MTIMES','Location', 'NorthWest')
%% Memory bound - PLUS, TIMES
subplot(2, 1, 2)
arrayGB = 4 * sizes / 1e9;
semilogx(sizes, 3 .* arrayGB ./ mmTimesRtimes, 'b.-', ...
sizes, 3 .* arrayGB ./ mmTimesAddtimes, 'g.-');
grid on
title('Single precision TIMES and PLUS')
xlabel('Matrix size (numel)')
ylabel('Memory Bandwidth (GB/sec)')
legend('TIMES', 'PLUS','Location', 'NorthWest')
gpu-timing.png