MATLAB: How to speed up matrix operations with lots of scalar constants involved

gpu accelerationmatrix arrayscalar constants

I have tried to transfer all arrays to gpu, but it's much slower.
dCnE=zeros(nr(jz),npmax(jz),'gpuArray');
dCnB=zeros(nr(jz),npmax(jz),'gpuArray');
dCnZ=zeros(nr(jz),npmax(jz)+1,'gpuArray');
Cn=zeros(nr(jz),ns,'gpuArray');
Cn2=zeros(nr(jz),ns,'gpuArray');
Cn0=Cnw(1:nr(jz),3:5:nsw-2); %Cnw is a gpu array.
CnE=zeros(nr(jz),npmax(jz),ns,'gpuArray');
CnB=zeros(nr(jz),npmax(jz),ns,'gpuArray');
CnZ=zeros(nr(jz),ns,npmax(jz)+1,'gpuArray');
CnZ(:,:,1)=Cnw(1:nr(jz),3:5:nsw-2);
CnZ=permute(CnZ,[1,3,2]);
I=I0*(exp(-((gpuArray.linspace(0,dr*(nr(jz)-1),nr(jz))/rz).^2)))'*exp(-((gpuArray.linspace(-t0,t0,nt+1)/tau).^2));
% i think dI,dCn,dCn2,dCn0 will be totally overwrited in one command, so they don't need to be preallocated.
for jl=1:ns
for jt2=1:(nt+1)
dI=(-sigma0*Cn0(:,jl)-sigma*Cn(:,jl)).*I(:,jt2)*dl-(beta0*Cn0(:,jl).*I(:,jt2).^2)*dl;
dCn=(sigma0*Cn0(:,jl)-(1-vs2)*sigma*Cn(:,jl)).*I(:,jt2)*dt/hb+Cn2(:,jl)*(1-exp(-dt/tau2))...
+vs2*(beta0*Cn0(:,jl).*(I(:,jt2).^2))*dt/(2*hb)-Cn(:,jl)/tau1*dt;
dCn2=(1-vs2)*sigma*Cn(:,jl).*I(:,jt2)*dt/hb+(1-vs2)*(beta0*Cn0(:,jl).*(I(:,jt2).^2))*dt/(2*hb)...
-Cn2(:,jl)*(1-exp(-dt/tau2));
dCn0=-sigma0*Cn0(:,jl).*I(:,jt2)*dt/hb-(beta0*Cn0(:,jl).*(I(:,jt2).^2))*dt/(2*hb)+Cn(:,jl)/tau1*dt;
if (npmax(jz)>nphth) %The most time-consuming part
Ipc=repmat(I(:,jt2),[1,npmax(jz)-1]);
dCnE(:,2:npmax(jz))=(sigma0*CnZ(:,2:npmax(jz),jl)-sigma*CnE(:,2:npmax(jz),jl)+...
vs2*sigma*CnE(:,1:npmax(jz)-1,jl)).*Ipc*dt/hb+CnB(:,2:npmax(jz),jl)*(1-exp(-dt/tau2))...
+vs2*(beta0*CnZ(:,1:npmax(jz)-1,jl).*(Ipc.^2))*dt/(2*hb)-CnE(:,2:npmax(jz),jl)/tau1*dt;
dCnB(:,2:npmax(jz))=(1-vs2)*sigma*CnE(:,1:npmax(jz)-1,jl).*Ipc*dt/hb+(1-vs2)*(beta0*CnZ(:,1:npmax(jz)-1,jl).*(Ipc.^2))*dt/(2*hb)...
-CnB(:,2:npmax(jz),jl)*(1-exp(-dt/tau2));
dCnZ(:,2:npmax(jz))=-sigma0*CnZ(:,2:npmax(jz),jl).*Ipc*dt/hb-(beta0*CnZ(:,2:npmax(jz),jl).*Ipc.^2)*dt/(2*hb)+CnE(:,1:npmax(jz)-1,jl)/tau1*dt;
dCnE(:,1)=(sigma0*CnZ(:,1,jl)-sigma*CnE(:,1,jl)).*Ipc(:,1)*dt/hb+...
-CnE(:,1,jl)/tau1*dt;
dCnZ(:,1)=-sigma0*CnZ(:,1,jl).*Ipc(:,1)*dt/hb-(beta0*CnZ(:,1,jl).*(Ipc(:,1).^2))*dt/(2*hb);
CnZ(:,:,jl)=CnZ(:,:,jl)+dCnZ;
CnB(:,:,jl)=CnB(:,:,jl)+dCnB;
CnE(:,:,jl)=CnE(:,:,jl)+dCnE;
end
I(:,jt2)=I(:,jt2)+dI;
non0=I(:,jt2);
non0(non0<0)=0;
I(:,jt2)=non0;
Cn(:,jl)=Cn(:,jl)+dCn;
Cn0(:,jl)=Cn0(:,jl)+dCn0;
Cn2(:,jl)=Cn2(:,jl)+dCn2;
non0=Cn(:,jl);
non0(non0<0)=0;
Cn(:,jl)=non0;
non0=Cn0(:,jl);
non0(non0<0)=0;
Cn0(:,jl)=non0;
end
end
Comparing to my other accelerated code, I think the reason is that there are much more scalar constants involved in this case, so I transfer them to gpu too. But it's still slower than the one without gpu computing. Maybe the reason is that nr(jz) varies from 23~171 which is too small. Is my thought correct? Is there any way of accelerating this script ?

Best Answer

If you have a modern Matlab version >= 2016b, use the auto expanding instead of an explicit repmat:
% Ipc = repmat(I(:,jt2),[1,npmax(jz)-1]);
Ipc = I(:,jt2);
... CnE(:,1:npmax(jz)-1,jl)) .* Ipc * dt / hb
I guess, that dt and hb are scalars (please do not let the readers guess!). Then remember, that Matlab evaluate the code from left to right:
Y = X * dt / hb
is equivalent to:
Temp = X * dt;
Y = Temp / hb;
These are 2 matrix operations. But you can combine the scalar operations:
Y = X * (dt / hb)
This is 1 matrix operation and a cheap scalar division.
As Matt J has explained already, operating on subarrays as in |dCnE(:,2:npmax(jz)) is expensive. Better crop the data, such that you can omit the ugly indexing.
I cannot run it by my own, because you did not provide the input values, but compare it optically:
Ipc=repmat(I(:,jt2),[1,npmax(jz)-1]);
dCnE(:,2:npmax(jz))=(sigma0*CnZ(:,2:npmax(jz),jl)-sigma*CnE(:,2:npmax(jz),jl)+ ...
vs2*sigma*CnE(:,1:npmax(jz)-1,jl)).*Ipc*dt/hb+CnB(:,2:npmax(jz),jl)*(1-exp(-dt/tau2))...
+vs2*(beta0*CnZ(:,1:npmax(jz)-1,jl).*(Ipc.^2))*dt/(2*hb)-CnE(:,2:npmax(jz),jl)/tau1*dt;
or:
c1 = 1 - exp(-dt / tau2); % Calculate constants outside the loops!
Ipc = I(:,jt2);
dCnE = (sigma0 * CnZ(:, :, jl) - sigma * CnE(:, :, jl) + ...
vs2 * sigma * CnE(:, :, jl)) .* (Ipc * (dt / hb)) + ...
CnB(:,:,jl) * c1 + ...
(vs2 * beta0 * dt / (2*hb)) * CnZ(:, :, jl) .* (Ipc .^ 2) - ...
CnE(:, :, jl) / (tau1 * dt);
This can be simplified a little bit for the CnE term:
dCnE = (sigma0 * CnZ(:, :, jl) + ...
CnE(:, :, jl)) .* (Ipc * ((vs2 * sigma - sigma) * dt / hb)) + ...
CnB(:,:,jl) * c1 + ...
(vs2 * beta0 * dt / (2*hb)) * CnZ(:, :, jl) .* (Ipc .^ 2) - ...
CnE(:, :, jl) / (tau1 * dt);
Methods:
  1. Combine the scalar operations to reduce the number of matrix operations.
  2. Move constants out of the loop.
  3. Crop the data outside the loops instead of applying an a:b indexing in each iteration.
  4. Use auto-expanding instead of inflating by repmat. bsxfun is sometimes faster or slower than auto-expanding - try it.
  5. Avoid repeated calculations. In your case e.g. Ipc(:,1)*dt/hb is computed several times. Using this avoids this:
C1 = I(:, jt2) * (dt/hb);
The latter includes another idea: You inflate I(:,jt2) at first by Ipc = repmat(...), and waste time with cropping out a subvector Ipc(:,1) later on. Better create the subvector once only: Ipc = I(:,jt2).
Finally: Replace
non0=Cn0(:,jl);
non0(non0<0)=0;
Cn0(:,jl)=non0;
by:
Cn0 = max(Cn0, 0);
and move it outside the loops, because you do not use the cleaned values inside the loop.
Related Question