Hummm, interesting problem. I would approach this slightly differently, taking advantage of two features to try and achieve some reasonably efficient solution:
- The requested substring length num is small and fixed.
- MATLAB'S inbuilt functions are highly optimized and quite efficient.
Given those points, I would simply loop over the string num times: 1st iteration start from position 1, 2nd iteration starting from position 2, etc. Within each iteration you can reshape the largest possible subset of the main string into a matrix and easily find the unique rows. This intermediate step slightly reduces memory requirements and builds up a "semi-unique" set of all substrings of the requested length. Also collect their indices and use histc to count the substring occurrences for that iteration. Then after the loop use efficient unique again to get the final list of unique substrings, together with accumarray to add the histc output counts together.
str = 'ZXCVBNMZXCVBAS';
num = 4;
tot = numel(str);
tmp = cell(1,num);
idh = cell(1,num);
for k = 1:num
vec = k-1:num:tot;
[tmp{k},~,idt] = unique(reshape(str(k:vec(end)),num,[]).','rows');
idh{k} = histc(idt,unique(idt));
end
[out,~,ido] = unique(vertcat(tmp{:}),'rows');
cnt = accumarray(ido,vertcat(idh{:}));
For that small sample string it returns:
>> out
out =
BNMZ
CVBA
CVBN
MZXC
NMZX
VBAS
VBNM
XCVB
ZXCV
>> cnt
cnt =
1
1
1
1
1
1
1
2
2
which seems to be correct. For the longer string (1e7 elements) it took my laptop
Elapsed time is 9.031 seconds.
to run, and I would imagine it scales roughly linearly with num. How does that compare to the methods you are using now?
Best Answer