MATLAB: How to speed up matrix operations with lots of scalar constants involved

gpu accelerationmatrix arrayscalar constants

I have tried to transfer all arrays to gpu, but it's much slower.

    dCnE=zeros(nr(jz),npmax(jz),'gpuArray');
    dCnB=zeros(nr(jz),npmax(jz),'gpuArray');
    dCnZ=zeros(nr(jz),npmax(jz)+1,'gpuArray');
    Cn=zeros(nr(jz),ns,'gpuArray');
    Cn2=zeros(nr(jz),ns,'gpuArray');
    Cn0=Cnw(1:nr(jz),3:5:nsw-2);  %Cnw is a gpu array.
    CnE=zeros(nr(jz),npmax(jz),ns,'gpuArray');
    CnB=zeros(nr(jz),npmax(jz),ns,'gpuArray');
    CnZ=zeros(nr(jz),ns,npmax(jz)+1,'gpuArray');
    CnZ(:,:,1)=Cnw(1:nr(jz),3:5:nsw-2);
    CnZ=permute(CnZ,[1,3,2]);
    I=I0*(exp(-((gpuArray.linspace(0,dr*(nr(jz)-1),nr(jz))/rz).^2)))'*exp(-((gpuArray.linspace(-t0,t0,nt+1)/tau).^2));
    % i think dI,dCn,dCn2,dCn0 will be totally overwrited in one command, so they don't need to be preallocated.
for jl=1:ns
    for jt2=1:(nt+1)
       dI=(-sigma0*Cn0(:,jl)-sigma*Cn(:,jl)).*I(:,jt2)*dl-(beta0*Cn0(:,jl).*I(:,jt2).^2)*dl;
       dCn=(sigma0*Cn0(:,jl)-(1-vs2)*sigma*Cn(:,jl)).*I(:,jt2)*dt/hb+Cn2(:,jl)*(1-exp(-dt/tau2))...
       +vs2*(beta0*Cn0(:,jl).*(I(:,jt2).^2))*dt/(2*hb)-Cn(:,jl)/tau1*dt;
       dCn2=(1-vs2)*sigma*Cn(:,jl).*I(:,jt2)*dt/hb+(1-vs2)*(beta0*Cn0(:,jl).*(I(:,jt2).^2))*dt/(2*hb)...
       -Cn2(:,jl)*(1-exp(-dt/tau2));
       dCn0=-sigma0*Cn0(:,jl).*I(:,jt2)*dt/hb-(beta0*Cn0(:,jl).*(I(:,jt2).^2))*dt/(2*hb)+Cn(:,jl)/tau1*dt;
    if (npmax(jz)>nphth)       %The most time-consuming part
        Ipc=repmat(I(:,jt2),[1,npmax(jz)-1]);
        dCnE(:,2:npmax(jz))=(sigma0*CnZ(:,2:npmax(jz),jl)-sigma*CnE(:,2:npmax(jz),jl)+...
        vs2*sigma*CnE(:,1:npmax(jz)-1,jl)).*Ipc*dt/hb+CnB(:,2:npmax(jz),jl)*(1-exp(-dt/tau2))...
        +vs2*(beta0*CnZ(:,1:npmax(jz)-1,jl).*(Ipc.^2))*dt/(2*hb)-CnE(:,2:npmax(jz),jl)/tau1*dt;
        dCnB(:,2:npmax(jz))=(1-vs2)*sigma*CnE(:,1:npmax(jz)-1,jl).*Ipc*dt/hb+(1-vs2)*(beta0*CnZ(:,1:npmax(jz)-1,jl).*(Ipc.^2))*dt/(2*hb)...
        -CnB(:,2:npmax(jz),jl)*(1-exp(-dt/tau2));
        dCnZ(:,2:npmax(jz))=-sigma0*CnZ(:,2:npmax(jz),jl).*Ipc*dt/hb-(beta0*CnZ(:,2:npmax(jz),jl).*Ipc.^2)*dt/(2*hb)+CnE(:,1:npmax(jz)-1,jl)/tau1*dt;
        dCnE(:,1)=(sigma0*CnZ(:,1,jl)-sigma*CnE(:,1,jl)).*Ipc(:,1)*dt/hb+...
        -CnE(:,1,jl)/tau1*dt;
        dCnZ(:,1)=-sigma0*CnZ(:,1,jl).*Ipc(:,1)*dt/hb-(beta0*CnZ(:,1,jl).*(Ipc(:,1).^2))*dt/(2*hb); 
        CnZ(:,:,jl)=CnZ(:,:,jl)+dCnZ;
        CnB(:,:,jl)=CnB(:,:,jl)+dCnB;
        CnE(:,:,jl)=CnE(:,:,jl)+dCnE;
    end
        I(:,jt2)=I(:,jt2)+dI;
        non0=I(:,jt2);
        non0(non0<0)=0;
        I(:,jt2)=non0;
        Cn(:,jl)=Cn(:,jl)+dCn;
        Cn0(:,jl)=Cn0(:,jl)+dCn0;
        Cn2(:,jl)=Cn2(:,jl)+dCn2;
        non0=Cn(:,jl);
        non0(non0<0)=0;
        Cn(:,jl)=non0;
        non0=Cn0(:,jl);
        non0(non0<0)=0;
        Cn0(:,jl)=non0;
       end
end

Comparing to my other accelerated code, I think the reason is that there are much more scalar constants involved in this case, so I transfer them to gpu too. But it's still slower than the one without gpu computing. Maybe the reason is that nr(jz) varies from 23~171 which is too small. Is my thought correct? Is there any way of accelerating this script ?

Best Answer

If you have a modern Matlab version >= 2016b, use the auto expanding instead of an explicit repmat:

% Ipc = repmat(I(:,jt2),[1,npmax(jz)-1]);
Ipc = I(:,jt2);
... CnE(:,1:npmax(jz)-1,jl)) .* Ipc * dt / hb

I guess, that dt and hb are scalars (please do not let the readers guess!). Then remember, that Matlab evaluate the code from left to right:

Y = X * dt / hb

is equivalent to:

Temp = X * dt;
Y    = Temp / hb;

These are 2 matrix operations. But you can combine the scalar operations:

Y = X * (dt / hb)

This is 1 matrix operation and a cheap scalar division.

As Matt J has explained already, operating on subarrays as in |dCnE(:,2:npmax(jz)) is expensive. Better crop the data, such that you can omit the ugly indexing.

I cannot run it by my own, because you did not provide the input values, but compare it optically:

Ipc=repmat(I(:,jt2),[1,npmax(jz)-1]);
dCnE(:,2:npmax(jz))=(sigma0*CnZ(:,2:npmax(jz),jl)-sigma*CnE(:,2:npmax(jz),jl)+ ...
      vs2*sigma*CnE(:,1:npmax(jz)-1,jl)).*Ipc*dt/hb+CnB(:,2:npmax(jz),jl)*(1-exp(-dt/tau2))...
      +vs2*(beta0*CnZ(:,1:npmax(jz)-1,jl).*(Ipc.^2))*dt/(2*hb)-CnE(:,2:npmax(jz),jl)/tau1*dt;

or:

c1 = 1 - exp(-dt / tau2);   % Calculate constants outside the loops!
Ipc  = I(:,jt2);
dCnE = (sigma0 * CnZ(:, :, jl) - sigma * CnE(:, :, jl) + ...
       vs2 * sigma * CnE(:, :, jl)) .* (Ipc * (dt / hb)) + ...
       CnB(:,:,jl) * c1 + ...
       (vs2 * beta0 * dt / (2*hb)) * CnZ(:, :, jl) .* (Ipc .^ 2) - ...
       CnE(:, :, jl) / (tau1 * dt);

This can be simplified a little bit for the CnE term:

dCnE = (sigma0 * CnZ(:, :, jl) + ...
       CnE(:, :, jl)) .* (Ipc * ((vs2 * sigma - sigma) * dt / hb)) + ...
       CnB(:,:,jl) * c1 + ...
       (vs2 * beta0 * dt / (2*hb)) * CnZ(:, :, jl) .* (Ipc .^ 2) - ...
       CnE(:, :, jl) / (tau1 * dt);

Methods:

Combine the scalar operations to reduce the number of matrix operations.
Move constants out of the loop.
Crop the data outside the loops instead of applying an a:b indexing in each iteration.
Use auto-expanding instead of inflating by repmat. bsxfun is sometimes faster or slower than auto-expanding - try it.
Avoid repeated calculations. In your case e.g. Ipc(:,1)*dt/hb is computed several times. Using this avoids this:

C1 = I(:, jt2) * (dt/hb);

The latter includes another idea: You inflate I(:,jt2) at first by Ipc = repmat(...), and waste time with cropping out a subvector Ipc(:,1) later on. Better create the subvector once only: Ipc = I(:,jt2).

Finally: Replace

        non0=Cn0(:,jl);
        non0(non0<0)=0;
        Cn0(:,jl)=non0;

by:

        Cn0 = max(Cn0, 0);

and move it outside the loops, because you do not use the cleaned values inside the loop.

Best Answer

Related Solutions

MATLAB: SYNTAX MATLAB

MATLAB: Subscript indices must either be real positive integers or logicals

Related Question