I have a 3.0 compute capability GPU in my computer, and the parallel processing toolbox.
My current code runs significantly faster on the CPU, even without parfor or spmd, than it does on the GPU. You can run the attached code, if you would like to try it.
My question is: how can I make this faster on the GPU, if a GPU is even the right tool for this kind of problem. I have looked at arrayfun and vectorization (I suspect it's as vectorized as it's getting) and glanced at writing CUDA kernals.
Two primary points:
1. I think CUDA/GPU is made more for a small number of operations of enormous matrices (operating with themselves, such as x=x*x, where size(x) > 1000). But as you can see, my code is thousands of operations for many different small matrices.
2. There are only 6 elements in this particular case that I need to change (5000 times). Everything else is the same.
Thank you for your help.
%%definitions
gm = 6e6*2*pi;llimit=-.01;ulimit=-llimit;step=2*ulimit;p=llimit:step/5000:ulimit;%%vector
B=ones(256,1);%%matrix
M = rand(256,256);% comment for quick disabling of gpu arrays to compare to CPU speed
p = gpuArray(p);B = gpuArray(B);M = gpuArray(M);gm = gpuArray(gm);C=gpuArray(0);R = C;Q = gpuArray.zeros(256,256);% comment above for quick disable
Delta=p*2*pi*1e6;tic;for n=1:length(p), Q(3,3) = -1i*(Delta(n)/2)-gm/2; Q(4,4) = 1i*(Delta(n)/2)-gm/2; Q(5,5) = -1i*(Delta(n)/2)-gm/2; Q(6,6) = 1i*(Delta(n)/2)-gm/2; Q(7,7) = -1i*Delta(n); Q(8,8) = 1i*Delta(n); Md = M+Q; C = Md\B; R(n) = real(C(2)); % C(2) = excited state pop rho_33
endtoc;figure;plot(p, gather(R))
Best Answer