Hi All,
I am matching many long sequences against each other in a way that the shorter sequence x slides against the longer reference sequence r in a search for the best match: the number of matching symbols (characters).
Here is my naive looped code that finds the best match. Do you have any suggestions on how to turn it into the fastest possible (vectorized) code? There might be a number of same sized shorter vectors x coming as inputs, so if possible to exploit it in the vectorization it would be ideal. I know this could be done with convolution on the binary vectors but my vectors are character symbols. Any ideas would be much appreciated.
%Finds the best match between shorter sequence x sliding against a longer reference sequence r
%The slide starts from the last element x on top of the 1st element of r:
%start (i=-2) 'THE'->
% 'ABCFGHETYUI'
%best match (i=4): 'THE'
%match score s is the number of elements in x that match (align) with r.
%match position i could be from -m+1:n+m-2 i.e. takes n+m-1 possible values
function [i,s]=find_seq(x,r)m=numel(x); n=numel(r);for i=1:m-1 %Counts matches in the left partial overlap
s(i)=sum(x(end-i+1:end)==r(1:i));endfor i=1:n-m+1 %Counts matches when x fully overlaps with r
s(i+m-1)=sum(x==r(i:i+m-1));endfor i=1:m-1 %Counts matches in the right partial overlap
s(n+i)=sum(x(1:m-i)==r(n-m+i+1:end));end[s,i]=max(s); i=i-m+1;
Best Answer