MATLAB: Fastest match between 2 sequences

Hi All,

I am matching many long sequences against each other in a way that the shorter sequence x slides against the longer reference sequence r in a search for the best match: the number of matching symbols (characters).

Here is my naive looped code that finds the best match. Do you have any suggestions on how to turn it into the fastest possible (vectorized) code? There might be a number of same sized shorter vectors x coming as inputs, so if possible to exploit it in the vectorization it would be ideal. I know this could be done with convolution on the binary vectors but my vectors are character symbols. Any ideas would be much appreciated.

%Finds the best match between shorter sequence x sliding against a longer reference sequence r
%The slide starts from the last element x on top of the 1st element of r:
%start (i=-2)    'THE'->
%                  'ABCFGHETYUI'
%best match (i=4):     'THE'
%match score s is the number of elements in x that match (align) with r.
%match position i could be from -m+1:n+m-2 i.e. takes n+m-1 possible values
function [i,s]=find_seq(x,r)
m=numel(x); n=numel(r);
for i=1:m-1                             %Counts matches in the left partial overlap 
    s(i)=sum(x(end-i+1:end)==r(1:i));
end
for i=1:n-m+1                           %Counts matches when x fully overlaps with r
    s(i+m-1)=sum(x==r(i:i+m-1));
end
for i=1:m-1                             %Counts matches in the right partial overlap
    s(n+i)=sum(x(1:m-i)==r(n-m+i+1:end));
end
[s,i]=max(s); i=i-m+1;

function [i, s] = find_seq_v3(x, r) m = numel(x); n = numel(r); s = zeros(1, n + m - 1); for i = 1:m-1 % left partial overlap s(i) = sum(x(m-i+1:m) == r(1:i)); end % Original: Long loop over r: % for i = 1:n-m+1 % fully overlaps with r % s(i+m-1) = sum(x == r(i:i+m-1)); % end % Faster: Short loop over x: q = 0; % full overlap for i = 1:m q = q + (x(i) == r(i:n-m+i)); end s(m:n) = q; for i = 1:m-1 % right partial overlap s(n+i) = sum(x(1:m-i) == r(n-m+i+1:n)); end [s,i] = max(s); i = i - m + 1; end

Best Answer

Another approach:

Speed comparison:

Elapsed time is 2.224595 seconds. Original
Elapsed time is 0.802787 seconds. Pre-allocation & UINT8
Elapsed time is 0.184660 seconds. Pre-allocation & UINT8 & Short loop

Just 8% of the original runtime. The FFT method of the paper might be magically fast. But some small changes can accelerate the naive implementation remarkably already.

I'm going to mex this tomorrow.

Best Answer

Related Solutions

MATLAB: Is this so slow

MATLAB: Can someone explain this large difference in compute time

Related Question