MATLAB: GPU memory overhead dependent on fft dimension.

Hello all, I have a question regarding memory management during Matlab's gpuArray/fft operation. I have a large NxM matrix [N = 10E3,M = 20E3, as an approx] where where I wish to take an fft in the M dimension. Now, for CPU operations I would normally permute the matrix to make the fft operation act in the 1st (column) dimension, for speed.

On the GPU, if I run the fft operation in the 1st dimension, I slam into the memory ceiling of my GPU. However, if I apply it in the row dimension I do not. I assume that this has to do with whether Matlab is doing N asynchronous fft's in the row direction, vs. a single massive matrix operation in the column dimension.

So, 4 questions:

Is my assumption true?
Are GPU operations still faster in the column direction (sort of answered this myself, got 3x speed advantage with below snippet.)
Is there a way to know what the GPU memory need will be for the fft? If so, I can try chunking up the fft based on the GPU memory available.
Is there another implementation that will have the speed of the column operation without the memory issues? I am going to try doing this as an arrayfun just to see.

Code snippet:

 x = gpuArray.rand(10000,10000);
xp = x.';
gputimeit(@() fft(x,[],1))
gputimeit(@() fft(xp,[],2))

Thanks all.

Best Answer

MATLAB uses cufft, so the behaviour is whatever its behaviour is. The implication of the batching API as described by the doc - https://docs.nvidia.com/cuda/cufft/index.html - is that batches that are contiguous result in multiple kernel launches. This will be slower, but more efficient with memory.

Because the amount of memory an FFT needs is so variable and dependent on signal length, it isn't that valuable to know what the size will be for any particular example. If you're curious you can watch the FreeMemory property output from gpuDevice:

gpu = gpuDevice
gpu.FreeMemory

After an FFT the FFT plan is retained so you should see how much memory it took up (as long as it's the first FFT you do in the MATLAB session). For working memory you can assume there will be a copy of the input, possibly two because MATLAB itself will often take a copy of the input in order to ensure your data is not corrupted in the event of an error.

If you can get your signals to be a power of 2 in length (say, 8192) you'll find them much more efficient with memory.

Best Answer

Related Solutions

MATLAB: Inexplicable GPU memory usage

MATLAB: Perfomance Loss of Matrix-Vector Multilplication on GPU with Array Indexing

Related Question