MATLAB: When using GPU Coder, how should I try to minimize how often data is transferred between CPU AND GPU

codercpudataefficiencygpuGPU Coderhalfkerneltransfer

I am using GPU Coder and am concerned about CPU/GPU data transfer affecting performance. Suppose I have two MATLAB functions with the 'coder.gpu.kernelfun' pragma at the top of each, and I do something with the data between calling them:
A = half(data);
B = kernelfun1(A); % output is B
% do something with B here
C = kernelfun2(B); % input is B
Does the data remain on the GPU the whole time as a half-precision float, or does it get copied to the CPU during the "do something with B" part?

Best Answer

GPU Coder tries to minimize copies between CPU and GPU. CPU/GPU copies purely depends on data access patterns.
If you generate code for kernelfun1 and kernelfun2 separately (i.e. you call 'codegen' twice) and then you try to call the generated mex functions like kernelfun1(b) .* kernelfun2(c) and kernelfun1 or kernelfun2 try to return a 'half' data type, then there will be a transfer to the CPU in order to perform the multiplication. This is a current limitation of MATLAB because 'gpuArary' does not support the half data type. However, if you do the multiplication in a wrapper function, e.g.:
function a = kernelfun3(b,c)
coder.gpu.kernelfun;
a = kernelfun1(b) .* kernelfun2(c);
end
and only call 'codegen' on func3, then GPU Coder will generate code such that the multiplication is performed on the GPU. 
The limitation above does not apply if the returned data type of kernelfun1 and kernelfun2 is 'single' or some other datatype supported by gpuArray. In that case, the following multiplication in will be performed on the GPU.
kernelfun1(b) .* kernelfun2(c)
CPU Coder tries to fuse the kernels as much as possible, so with the example above, you may find that the generated code contains a single GPU kernel instead of three separate ones for kernelfun1, kernelfun2 and kernelfun3. The effectiveness of this optimization depends on program structure and dataflow. However, we also noticed this optimization not happening in some cases. We recommend trying out code generation on your design and looking at the generated code to see whether the coder performed this optimization. If not, you can try altering your design to get the desired results.