MATLAB: How to extract text from string at the same location, one line above

MATLABperformancesearchspeedstrfindstring

I have a variable number of text files (between 3-8), each between 20,000 and 30,000 lines long (different lengths), and around 400 words to search for. The words have different lengths.

Let's say I have the following text:

xxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxx999xxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxx12345xxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

where xxxxx can be anything other than what I want to search for. I want to make check whether the following is true:

That each text file includes '12345'
That for at least one occurrence of '12345' in each file, there is '999'. The end of '999' always coincides with the end of '12345'.

I can determine whether '12345' is in each of the text files using strfind, but strfind only ouputs an "index" value for the first character of my search pattern (e.g. 613587). Is there a way to find the line number that "index" value corresponds with, and search one line above for '999'?

I think I saw people recommending that each line for each file be read as a separate string, then search each string independently, but that seems like a lot of work for MATLAB to go through, having to generate close to a hundred thousand strings. Is there a better/more efficient way of achieving this?

Any help would be appreciated!

Best Answer

"Is there a better/more efficient way of achieving this?" No, I don't think so. However, speed depends on how "each line for each file be read as a separate string" is done. (Are strings in an array separate?)

"that seems like a lot of work for MATLAB" Don't guess and don't rely on hearsay. Make a simple test.

I assume that your example is oversimplefied and that the script below won't work with the actual files. However, it might help you to estimate execution times.

I made a test file, cssm.txt, with 30,000 lines by copying and modifying lines from your question. It contains only one pair

xxxxxxxxxxxxxxxxxxxxxxxxxx999xxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxx12345xxx

which is at line 15001.

The script below contains two independent solutions and strfind(chr,'12345') for comparison. The elapse times for the three cases are

Elapsed time is 0.006313 seconds.
Elapsed time is 0.053587 seconds.
Elapsed time is 0.020818 seconds.

on a vanilla desktop and R2018b. The execution time of the second solution is less than four times that of fileread(); strfind();. Eight files and four hundred words should be possible to process in a bit more than one minute (8*400*0.02). During my test the text file was somewhere in the cache system. The execution time will depend (a little) on whether you have a SSD or spinning disk.

%%





%#ok<*NASGU>
tic 
chr = fileread('cssm.txt'); 
pos = strfind( chr, '12345' );
toc
%%
tic 
fid = fopen('cssm.txt','rt');
cac = textscan( fid, '%s', 'Delimiter','\n' );
str = reshape( string( cac{1} ), 1,[] );
fclose( fid );
%%
e1  = regexp( str, "999", 'end', 'once' );
e2  = regexp( str, "12345", 'end', 'once' );
is1 = not( cellfun( 'isempty', e1 ) );
is2 = not( cellfun( 'isempty', e2 ) );
%%
pos = find( is1 & [ is2(2:end), false ], 1, 'first' ); 
%

found = false;              
for p = reshape( pos, 1,[] )
    if e1{p}==e2{p+1}
        found = true;
        break
    end
end
toc
%%
tic 
fid = fopen('cssm.txt','rt');
cac = textscan( fid, '%s', 'Delimiter','\n' );
str = reshape( string( cac{1} ), 1,[] );
fclose( fid );
%%
is1 = contains( str, "999" );
is2 = contains( str, "12345" );
pos = find( is1 & [ is2(2:end), false ], 1, 'first' ); 
%
found = false;
for p = reshape( pos, 1,[] )
    if regexp(str(p),"999",'end','once') == regexp(str(p+1),"12345",'end','once')
        found = true;
        break
    end
end
toc

(There are edge cases for which this script will throw errors.)

Related Solutions

MATLAB: ‘strcmp’ works when trying to count number of times a character occurs in a textfile but ‘findstr’ fails!

        s=strfind(oneread,character);%looking for word in each sentence

strfind() returns a list of indices in oneread where the pattern character starts. If no locations are found then strfind() returns empty.

        if isempty(s==0)%checking if s has any elements

s itself can be empty, but it will never contain any 0. In the context of your code, s==0 will be an array of false values the same size as s. Checking isempty() of that would be the same as checking isempty(s)

            c=c+1;

So you are counting the number of times that no matches were found in the line.

Perhaps you intended

        if isempty(s)==0 %checking if s has any elements

which would be the same as

        if ~isempty(s)

in which case you would at least be counting the number of lines that it was found on.

But you are supposed to be checking the count of matches, not the number of lines: if the pattern occurs more than once on the same line, then it should be counted more than once.

If you are intended to count words, then you need to be careful. Suppose you are asked to count 'the' and the text is 'The theoretical theatre theologizes thews." Then the proper answer is either zero ('The' is not 'the') or one (if case distinctions are to be ignored), but you would find four.

MATLAB: Storing data from 2 lines of text

Well, yes, you're reading the first three lines, of which the first two are mostly text only. So of course, it only works on the 3rd line.

You need to skip the first two lines before reading the next two. Two fgets or fgetl (I prefer the latter) would take care of that.

Note that I wouldn't hardcode the position of the numbers in each line as that's quite fragile. I would just detect the position of the two '*' and extract the portion of text in between. str2num can then convert all the numbers in between in one go:

%...

fid = fopen(fullfile(path, file), 'rt'); %fullfile is better than strcat, 'rt' for text file
fgetl(fid); fgetl(fid); %skip first two lines
for lcount = 1:2
   tline = fgetl(fid);
   starpos = find(tline == '*');
   assert(numel(starpos) == 2, 'line %d does not have two *', lcount+2);
   numbers{lcount} = str2num(tline(starpos(1)+1 : starpos(2)-1));
end
fclose(fid);
%...

Best Answer

Related Solutions

MATLAB: ‘strcmp’ works when trying to count number of times a character occurs in a textfile but ‘findstr’ fails!

MATLAB: Storing data from 2 lines of text

Related Question