MATLAB: Vectorizing nonlinear matrix operation on many small matrices

gpuGPU CoderMATLABmatlab codermexvectorization

I am trying to optimize the following generic matrix operation:
m = 3; % small number in general
n = 2^20; % large power of 2 in general
A = rand(m,n);
B = zeros(m^2,m^2);
for ii = 1:size(A,2)
a = A(:,ii);
r = a*a';
B = B + kron(r,r);
% return B
On my computer the above takes ~7s. By compiling to a MEX file with MATLAB Coder I can improve this by ~15x. I have tried compiling to CUDA with GPU Coder, but this seems to be quite inefficient.
I think the difficulty comes from two different sources:
1) I am not sure of an efficient way to vectorize the creation of the "r" matrices from the columns of the A matrix, and so have to resort to the outer for loop approach
2) I think the Kronecker product is inefficient to implement on the gpu due to the small matrix size
The speedup from compiling to MEX is nice, but I just have this feeling that I am still doing something quite inefficiently. I would appreciate if anyone has any ideas on how to optimize the above calculation, either along the lines of the two difficulties I outlined above, or via a different approach.

Best Answer

m = 3; % small number in general
n = 2^20; % large power of 2 in general
A = rand(m,n);
B = zeros(m^2,m^2);
for ii = 1:size(A,2)
a = A(:,ii);
r = a*a';
B = B + kron(r,r);
Elapsed time is 6.800329 seconds.
Elapsed time is 0.081757 seconds.