I am trying to optimize the following generic matrix operation:
m = 3; % small number in general
n = 2^20; % large power of 2 in general
A = rand(m,n);B = zeros(m^2,m^2);for ii = 1:size(A,2) a = A(:,ii); r = a*a'; B = B + kron(r,r);end% return B
On my computer the above takes ~7s. By compiling to a MEX file with MATLAB Coder I can improve this by ~15x. I have tried compiling to CUDA with GPU Coder, but this seems to be quite inefficient.
I think the difficulty comes from two different sources:
1) I am not sure of an efficient way to vectorize the creation of the "r" matrices from the columns of the A matrix, and so have to resort to the outer for loop approach
2) I think the Kronecker product is inefficient to implement on the gpu due to the small matrix size
The speedup from compiling to MEX is nice, but I just have this feeling that I am still doing something quite inefficiently. I would appreciate if anyone has any ideas on how to optimize the above calculation, either along the lines of the two difficulties I outlined above, or via a different approach.
Best Answer