MATLAB: Uploading small vectors to GPU is faster than directly creating them

cudagpugpuarraymallocParallel Computing Toolbox

Hello,
while experimenting with the GPU support for MATLAB I noticed that it is faster to create an array and uploading it afterwards with gpuArray than creating it directly with the functions provided by parallel.gpu.GPUArray.
To show this behaviour for different array sizes I've written a small test script. I've uploaded a image of the most interesting output here. What it shows is that for array sizes smaller than 300.000 elements it is faster to create the array on the CPU and then uploading it with gpuArray. I know that it is probably bad practice on my side if I want to create many small arrays on the GPU. But the missing gpuArray indexing support in the trial version I tested did not allow me to create one big matrix. Unfortunately the trial is now over. Now I wanted to ask if this behavior is normal or some kind of a bug. For me it looks like the GPUArray functions hit some kind of a bottleneck because the CPU usage was at 100% for one core when running the test.
-------------------------------------------------------------------------------------
MATLAB Version 7.11.1.866 (R2010b) Service Pack 1
MATLAB License Number: DEMO
Operating System: Microsoft Windows 7 Version 6.1 (Build 7600)
Java VM Version: Java 1.6.0_17-b04 with Sun Microsystems Inc. Java HotSpot(TM) 64-Bit Server VM mixed mode
-------------------------------------------------------------------------------------
MATLAB Version 7.11.1 (R2010bSP1)
Parallel Computing Toolbox Version 5.0 (R2010bSP1)
Parallel Computing Toolbox Version 5.0 (R2010bSP1)
Signal Processing Toolbox Version 6.14 (R2010bSP1)
  • Windows 7
  • Nvidia Driver: 275.33
  • Cuda Toolkit v4.0
  • GPU: GeForce GTX 460 with 1 GB Memory (not sure about this at the moment because I can't look it up)
The test script:
MBtoLeave = 10;
dev = gpuDevice();
free = dev.FreeMemory;
free = free - MBtoLeave * 1024^2;
maxPow = floor(log2(free/4));
mallocFuncs = {@(y, x, t) gpuArray(zeros(y, x, t)),...
@parallel.gpu.GPUArray.zeros, @parallel.gpu.GPUArray.ones...
@parallel.gpu.GPUArray.nan, @parallel.gpu.GPUArray.inf,...
};
results = {};
for mallocFunc=mallocFuncs
mallocFunc = mallocFunc{:};
fprintf('Malloc method: %s\n', func2str(mallocFunc));
iterations = 2000;
times = [];
for pow=0:maxPow-1
mallocSize = 2^pow;
tic;
for k=1:iterations
g = mallocFunc(mallocSize, 1, 'single');
end
time = toc;
mallocsPerSec = iterations/time;
bytesPerSec = mallocsPerSec * mallocSize * 4;
fprintf('Malloc size: %d, time: %f, mallocs per second: %f, MB/s: %f\n',...
mallocSize, time, mallocsPerSec, bytesPerSec/1024^2);
if time > 1.5
iterations = iterations / 2;
end
times(end+1, 1:4) = [mallocSize, time, mallocsPerSec, bytesPerSec];
end
results(end+1, 1:2) = {mallocFunc, times};
end
mallocsPerSec = cell2mat(cellfun(@(x) x(:, 3), results(:, 2), 'UniformOutput', false)');
mbPersSec = cell2mat(cellfun(@(x) x(:, 4), results(:, 2), 'UniformOutput', false)') / 1024^2;
x = results{1, 2}(:, 1);
semilogx(x, mallocsPerSec);
title('Allocations per second');
legend(cellfun(@func2str, results(:, 1), 'UniformOutput', false))
figure
semilogx(x, mbPersSec);
legend(cellfun(@func2str, results(:, 1), 'UniformOutput', false))
title('Megabytes per second');
figure
loglog(x, 1./mbPersSec);
legend(cellfun(@func2str, results(:, 1), 'UniformOutput', false))
title('Time to transfer 1 MB');

Best Answer

The parallel.gpu.GPUArray.* build functions in R2010b were provided mostly to allow you to avoid host allocation prior to copying to the GPU, and we've made performance improvements to them (and many other things too) since that release; however in R2011a the build family of functions still have MATLAB-code wrappers which do introduce some overhead. We're working on removing all of those to improve performance. In case you weren't aware, R2011a also has GPUArray indexing.