MATLAB: PTX kernel time to run – Math Solves Everything

Hello, i am using R2010b, CUDA toolkit 3.1 with a geforce gt425m. While is was optimalizing my cuda code i observed that calling the kernel with feval in matlab has a ~2ms constant time measured with

tic feval(k,…) toc

the kernel code:

    #define C_WIDTH 1024
    #define C_HEIGHT 768
    __global__ void timetest1(float* holo) {    
     int mindex=blockIdx.x*blockDim.x+threadIdx.x;
     int size=C_WIDTH*C_HEIGHT;
     if (mindex>=size) 
    return;
     holo[mindex]=mindex*mindex;
    }

Even if i take out the write to global memory //holo[mindex]=mindex*mindex; there is a ~2ms time

Does anybody know the origin of this lag? It would be great to somehow eliminate it.

Thanks,

Gaszton

PS: my matlab code for the kernel:

clear

import parallel.gpu.GPUArray

xsize=1024; ysize=768;

vectorsize=xsize*ysize; threadpblock=1024; k=parallel.gpu.CUDAKernel('TimeTest.ptx', 'TimeTest.cu'); k.ThreadBlockSize=[threadpblock,1,1]; k.GridSize=[ceil(vectorsize/threadpblock),1];

dholo=parallel.gpu.GPUArray.zeros(vectorsize,1,'single');

tic [dholo]=feval(k,dholo); time=toc;

['ms time= ' num2str(time*1000)]

clear

function tmp import parallel.gpu.GPUArray xsize=1024; ysize=768; vectorsize=xsize*ysize; threadpblock=512; % I have a C1060 k=parallel.gpu.CUDAKernel('TimeTest.ptx', 'TimeTest.cu'); k.ThreadBlockSize=[threadpblock,1,1]; k.GridSize=[ceil(vectorsize/threadpblock),1]; dholo=parallel.gpu.GPUArray.zeros(vectorsize,1,'single'); tic for ii = 1:1000 dholo=feval(k,dholo); end time=toc; disp(['ms time= ' num2str(time)])

Best Answer

Firstly, can I suggest that if possible you should upgrade to R2011a as we have made quite a few performance improvements in that release. Secondly, I think the main bottleneck in your code as written is that outside a function, an important optimisation called "in-place optimisation" cannot take place. If you place your code inside a function, then "dholo" will not be copied. For reference, I made a function like this:

And the overhead on my C1060 was down to 0.05 ms.

Best Answer

Related Solutions

MATLAB: Very simple cuda ptx code memory speed

MATLAB: Why using gpu.Array.zeros I have error Undefined variable “parallel” or class “parallel.​gpu.gpuArr​ay.zeros”

Related Question

MATLAB: Why using gpu.Array.zeros I have error Undefined variable “parallel” or class “parallel.gpu.gpuArray.zeros”