Hello,
while experimenting with the GPU support for MATLAB I noticed that it is faster to create an array and uploading it afterwards with gpuArray than creating it directly with the functions provided by parallel.gpu.GPUArray.
To show this behaviour for different array sizes I've written a small test script. I've uploaded a image of the most interesting output here. What it shows is that for array sizes smaller than 300.000 elements it is faster to create the array on the CPU and then uploading it with gpuArray. I know that it is probably bad practice on my side if I want to create many small arrays on the GPU. But the missing gpuArray indexing support in the trial version I tested did not allow me to create one big matrix. Unfortunately the trial is now over. Now I wanted to ask if this behavior is normal or some kind of a bug. For me it looks like the GPUArray functions hit some kind of a bottleneck because the CPU usage was at 100% for one core when running the test.
-------------------------------------------------------------------------------------MATLAB Version 7.11.1.866 (R2010b) Service Pack 1MATLAB License Number: DEMOOperating System: Microsoft Windows 7 Version 6.1 (Build 7600)Java VM Version: Java 1.6.0_17-b04 with Sun Microsystems Inc. Java HotSpot(TM) 64-Bit Server VM mixed mode-------------------------------------------------------------------------------------MATLAB Version 7.11.1 (R2010bSP1)Parallel Computing Toolbox Version 5.0 (R2010bSP1)Parallel Computing Toolbox Version 5.0 (R2010bSP1)Signal Processing Toolbox Version 6.14 (R2010bSP1)
- Windows 7
- Nvidia Driver: 275.33
- Cuda Toolkit v4.0
- GPU: GeForce GTX 460 with 1 GB Memory (not sure about this at the moment because I can't look it up)
The test script:
MBtoLeave = 10;dev = gpuDevice();free = dev.FreeMemory;free = free - MBtoLeave * 1024^2;maxPow = floor(log2(free/4));mallocFuncs = {@(y, x, t) gpuArray(zeros(y, x, t)),... @parallel.gpu.GPUArray.zeros, @parallel.gpu.GPUArray.ones... @parallel.gpu.GPUArray.nan, @parallel.gpu.GPUArray.inf,... };results = {};for mallocFunc=mallocFuncs mallocFunc = mallocFunc{:}; fprintf('Malloc method: %s\n', func2str(mallocFunc)); iterations = 2000; times = []; for pow=0:maxPow-1 mallocSize = 2^pow; tic; for k=1:iterations g = mallocFunc(mallocSize, 1, 'single'); end time = toc; mallocsPerSec = iterations/time; bytesPerSec = mallocsPerSec * mallocSize * 4; fprintf('Malloc size: %d, time: %f, mallocs per second: %f, MB/s: %f\n',... mallocSize, time, mallocsPerSec, bytesPerSec/1024^2); if time > 1.5 iterations = iterations / 2; end times(end+1, 1:4) = [mallocSize, time, mallocsPerSec, bytesPerSec]; end results(end+1, 1:2) = {mallocFunc, times}; end mallocsPerSec = cell2mat(cellfun(@(x) x(:, 3), results(:, 2), 'UniformOutput', false)'); mbPersSec = cell2mat(cellfun(@(x) x(:, 4), results(:, 2), 'UniformOutput', false)') / 1024^2; x = results{1, 2}(:, 1);semilogx(x, mallocsPerSec);title('Allocations per second');legend(cellfun(@func2str, results(:, 1), 'UniformOutput', false))figuresemilogx(x, mbPersSec);legend(cellfun(@func2str, results(:, 1), 'UniformOutput', false))title('Megabytes per second');figureloglog(x, 1./mbPersSec);legend(cellfun(@func2str, results(:, 1), 'UniformOutput', false))title('Time to transfer 1 MB');
Best Answer