A few points here. Firstly, (and most importantly), to time code on the GPU, you need to use either gputimeit, or you need to inject a call to wait(gpuDevice) before calling `toc`. That's because work is launched asynchronously on the GPU, and you only get accurate timings by waiting for it to finish. With those minor modifications, on my GPU, I see 0.09 seconds for the `gpuArray` method, and 0.18 seconds for the `arrayfun` version.
Running a loop of GPU operations is generally inefficient, so the main gain you can get here is by pushing the loop inside the `arrayfun` function body so that that loop runs directly on the GPU. Like this:
function x = test_function(x,Nt)
for ii = 1:Nt
x = exp(-1i*(x + abs(x).^2));
end
end
You'll need to invoke it like `A = arrayfun(@test_function, A, Nt)`. On my GPU, this brings the `arrayfun` time down to 0.05 seconds, so about twice as fast as the plain `gpuArray` version.
Best Answer