Hello all,
BLUF: Why do element-wise operations on the GPU seem to scale more slowly than operations like mtimes; even if A+B runs in less absolute time than A*B, the relative speed does not seem to scale as I would expect from computational complexity. Thus, A*B is "faster" than A+B. Why?
I am trying to understand some Matlab speed bottlenecks on the GPU using the techniques I found here (Measuring GPU Performance) as well as GPUbench.
What I do not understand is why element-wise array operations appear slow as compared to mtimes and other benchmark measures using matrices of size (N x N). For example, for now I am going to assume that the number of floating point operations for element operations scales linearly with the total number of points (N^2), while for Mtimes I will use the standard (2*N.^3 – N.^ 2).
To benchmark these, I run the following:
sizes = power(2, 12:2:24);N = sqrt(sizes);mmTimesHost = inf(size(sizes));mmTimesRtimes = inf(size(sizes));mmTimesAddtimes = inf(size(sizes));mmTimesMtimes = inf(size(sizes));for ii=1:numel(sizes) ii A = single(rand( N(ii), N(ii) )); B = single(rand( N(ii), N(ii) )); A = gpuArray(A); B = gpuArray(B); mmTimesRtimes(ii) = gputimeit(@() A.*B); mmTimesMtimes(ii) = gputimeit(@() A*B); mmTimesAddtimes(ii) = gputimeit(@() A+B);endsemilogx(sizes,sizes./mmTimesRtimes/1E9, 'b.-',sizes,(2*N.^3 - N.^ 2)./mmTimesMtimes/1E9, 'r.-',... sizes,sizes./mmTimesAddtimes/1E9,'g.-');grid ontitle('Single precision matrix-matrix rtimes')xlabel('Matrix size (numel)')ylabel('Calculation Rate (GFLOPS)')legend('Rtimes', 'Mtimes', '+','Location', 'NorthWest')
In the resulting plot, I get a maximum ~= 7000 GFLOPS for mtimes (A*B), which jives with GPUbench …the element-wise operations are at the very bottom of the scale.
I can see them if I set
ylim([0 20])
and see that (assuming my scaling is right) I peg out at 20 GFLOPS which seems insanely slow. I want to know why; I can only think of four (good) options:
- I have misunderstood how the number of FLOPs scales with the number of elements involved. I don't see how this could be true for element-wise operations like (+). Although I could be missing a scale factor, I do not believe it is 300 as the above speeds suggest, and I don't see why the number of operations should scale more than linearly.
- There is something inherently "faster" for mtimes than for (+).
- There is something different in the Matlab implementation of these functions on the GPU, such that for memory or other reasons mtimes has been optimized, but (+) and A.*B have not.
- There is some implementation issue on my end that I am missing (I did try converting A and B to colums rather than matrices, no dice).
There are probably other things I am not thinking about too, but at the end of the day I want to know:
- why element-wise operations seem so much slower than it seems they should be?
- Is there a way to improve the speed/scaling of these operations? (maybe using the new GPU coder, custom CUDA code, changing the properties of my gpuDevice, or batching my data in a certain way).
Cheers,
-DP
Best Answer