MATLAB: Advice on when to start writing custom CUDA

Hello all, I currently take advantage of a lot of Matlab's GPU enabled functions (matrix operations, FFTs, etc.) which provide great speed advantages over their CPU counterparts. I thought GPU matrix operations were fast…until I discovered a CPU/GPU/MEX-CUDA comparison for running Conway's game of life. On my machine, the MEX-CUDA version was ~50X faster than the GPU version. Or, using the CPU as a comparison

CPU – 1X
GPU – 7X
MEX-CUDA – 350X

So, with those kind of speed gains, I am finally feeling motivated to learn some CUDA. However, what is not clear to me is: When is there a significant speed advantage in writing one's own CUDA kernel? There several narrower questions:

I use GPU > CPU when I can frame an operation as a binary array operation (matrix additions and the like, NOT sorts). If I meet this criterion, is there a second criterion that says I should write my own CUDA code rather than doing everything inside Matlab and leveraging the built-in GPU enabled functions using e.g. arrayfuns?
I assume that the GPU enabled functions for Matlab, like fft, interp1 (with a linear interpolant), exp, etc. are all already as accelerated as they get: I could not write a faster fft myself. Instead, it must be other problems that can be framed as binary array operations (like the stencil problem from the Life example) that would require special treatment. Is this true?

Finally, if someone has a nice starting place for CUDA with Matlab, I would appreciate the link; I know a bit of C, but the example files in the Life tutorial, such as pctdemo_life_mex_shmem.cu, are a little outside of my current skill set.

Cheers, Dan

Best Answer

This particular problem is a good example of something that is solved better with a custom kernel, but many real problems aren't like that. The MATLAB version of the game of life has to launch multiple kernels to do all the indexing and then create the mask, while the custom kernel can do all this communication between the different cells in a single kernel using shared memory.

In real-life code you most often see two types of algorithm: algorithms where the majority of the cost is in single MATLAB function calls to computationally intensive operations like fft or svd; and algorithms that contain large sequences of element-wise operations. In the first case, as you correctly surmise, MATLAB does a very good job of providing just about the best implementation you can write for that function, albeit perhaps limited by its need to be completely general rather than specific to your problem. In the second case, MATLAB does a good job of creating kernels on-the-fly that merge all the element-wise operations - or there's always arrayfun to fine-tune that yourself.

But nothing can beat writing your own CPU and GPU code specifically honed to your problem, as long as you trust yourself to write the best possible algorithm. If you do, it's worth it. If not, you may get better luck working with MATLAB's strengths, for instance, vectorizing your code and using the profiler to look for and optimize bottlenecks.

There has also been talk of providing a MATLAB function that generalizes stencil-type operations like convolutions. If we do this in a future version we may be able to provide a better solution to a large set of problems than directing users to mexcuda.

Best Answer

Related Solutions

MATLAB: Which version of Visual Studio is fully supported by Matlab with Cuda Coding

MATLAB: Parallel computing toolbox…gpu choice

Related Question