For a 1000x1000 matrix, this is 6 times faster already:
n = size(X, 1);
X = X.';
Y = Y.';
dx = zeros(n, n);
dy = zeros(n, n);
for j = 1:n
Xj = X(:, j);
Yj = Y(:, j);
for i = j+1:n
dx(i,j) = sqrt(sum(bsxfun(@minus, X(:, i), Xj) .^ 2));
dy(i,j) = sqrt(sum(bsxfun(@minus, Y(:, i), Yj) .^ 2));
dx(j,i) = dx(i,j);
dy(j,i) = dy(i,j);
end
end
dz = max(dx, dy);
The original function took 29.5 sec (R2016b, Core2Duo, Win7/64), and the cleaned version 5.2 sec.
Here the data are processed columnwise, which is much faster because neighboring elements are accessed much faster in the memory. Then the comparison my max() is done outside the loop. And finally the resulting matrix is symmetric and you can omit the computation of X(:,i) and X(:,j) if you have the results for X(:,j) and X(:,i) already.
I tried to vectorized the inner loop:
n = size(X, 1);
X = X.';
Y = Y.';
dx = zeros(n, n);
dy = zeros(n, n);
for j = 1:n
dx(j+1:n,j) = sqrt(sum(bsxfun(@minus, X(:, j+1:n), X(:, j)) .^ 2, 1));
dy(j+1:n,j) = sqrt(sum(bsxfun(@minus, Y(:, j+1:n), Y(:, j)) .^ 2, 1));
dx(j,j+1:n) = dx(j+1:n,j);
dy(j,j+1:n) = dy(j+1:n,j);
end
dz = max(dx, dy);
But this takes 21 sec for 1000x1000 arrays. But for smaller 100x100 inputs it is faster: 1.2 sec instead of 2.2 sec (100 iterations).
Now you have an efficient function to start a parallelization or computation on the GPU. Maybe this is useful (I cannot test it):
parfor v = 1:2
if v == 1
for j = 1:n
dx(j+1:n, j) = sqrt(sum((X(:, j+1:n) - X(:, j)) .^ 2, 1));
dx(j, j+1:n) = dx(j+1:n, j);
end
else
for j = 1:n
dy(j+1:n, j) = sqrt(sum((Y(:, j+1:n) - Y(:, j)) .^ 2, 1));
dy(j, j+1:n) = dy(j+1:n, j);
end
end
end
But parfor for the inner loop will use more cores.
Best Answer