MATLAB: Is regexp including extra data

MATLABregexpsplit

I’m trying to use REGEXP to match the following flags (State, Post, RC, State, Junk) , and then create a cell array of strings.

My inputs are:

1  25.187466  156.162447  21578.188  97.134234  State  AAAAA  1  C00B
2  25.287466  156.162447  21578.288  97.234234  Post  BBBBB  2  C11B
9  25.387466  156.362447  21578.388  97.334234  RC  CCCCC  3  C22B
99  25.387466  156.362447  21578.388  97.334234  State  DDDDD  4  C33B
999  25.387466  156.362447  21578.388  97.334234  Junk  EEEEE  5  C44B

I’m using the following MATLAB commands:

data = regexp(LineTxt,'-?\d+(\.\d+)?','split');
Flag=cellstr(data{1,6});
For unknown reasons I keep getting the following output:
'  State  AAAAA'  '  Post  BBBBB'  '  RC  CCCCC'  '  State  DDDDD'  '  Junk  EEEEE'

Intended output is:

' State '  ' Post '  ' RC '  ' State '  '  Junk'

Why are the extra fields being included?

Best Answer

Neither do I understand why. Regular expressions are tricky. Another approach:

    loop over all rows
        data = regexp( str, '((State)|(Post)|(RC)|(State)|(Junk))', 'match','once' );
    end

[Edit] IMO Given that the file is written with a similar format string, textscan is the best way to read the file.

    fid = fopen( 'cssm.txt' );
    cac = textscan( fid, '%*u%*f%*f%*f%*f%s%*s%*d%*s' );
    fclose( fid );
    cac{:}

returns

    ans = 
        'State'
        'Post'
        'RC'
        'State'
        'Junk'
.

Answer to comment:

    str = fileread( 'cssm.txt' );
    data = regexp( str, '((State)|(Post)|(RC)|(State)|(Junk))', 'match' )

returns

    data = 
        'State'    'Post'    'RC'    'State'    'Junk'

where cssm.txt contains your five lines of data

However, the smarter the solution the harder it is to make a robust (and flexible) code. How should the code behave if there are rows in the file, which do not adhere to the "format" that I interfere from your example? And in a few days, you might find that you need the third column.

Related Solutions

MATLAB: Can rectangularized data be achieved by modifying a text file with MATLAB

Hi Brad. Instead of

   InputText=textscan(fid,'%s',1,'delimiter','\n');

Try this out

   [InputText, dn1]=fscanf(fid,'%c', [1 50]);

if the length of your file is as written in the original question, your header (1-BY-50) should look like this

   InputText =
   NFL58  23Mar2012  Show  2  1  01  0000000001  Low

I honestly didn't get what is the size of your header. Anyhow, "Try simplest solutions first" .. and it is quite simpler if you wrote a code to read info from a file as: -text as a matrix of char, and -numbers as matrix of numbers only (2-D array).

Also try fscanf function (returns output as a matrix) instead of textscan (returns a cell array). Matrices are so much easier to handle than cell arrays.

However, if you tried something as:

   [data,nd2]=fscanf(fid,'%g', [1 inf]);

which starts reading numeric data from the file, from where it stopped the last time, you'll get this:

   data=
   1.0e+004 *
       0.0001    0.0025    0.0156    2.1578    0.0097

meaning the code stopped at encountering a char ( it is 'stops') because it is not numeric (floating-point number type, declared with %g). So you need to isolate the header from the rest, or at least isolate char from numbers in the first few lines.

I think you got the point to finish the rest.

Regards.

MATLAB: How to read strings from file with fscanf or sscanf (NOT textscan)

Just a few alternate thoughts (and I'll think about FSCANF over the week end a little more).

=== Using REGEXP (available in almost all languages):

.. and the following content (to illustrate the flexibility):

 1 A ABC
 2 B ABC
 3 C ABC DEF
 4 D ABC
 5 E ABC FGH
 6 F ABC
 7 G ABC
 8 H ABC
 9 I ABC
 10 J ABC

Code:

 >> buffer = fileread('data.txt') ;    % Could be performed with FOPEN/FREAD 
                                       % to be more generic.
 >> pattern = '(?<Column1>\d+)\s(?<Column2>\w+)\s+(?<Column3>.*?)[\r\n]' ;
 >> n = regexp(buffer, pattern, 'names')
 n = 
 1x10 struct array with fields:
    Column1
    Column2
    Column3
 >> n(2)
 ans = 
    Column1: '2'
    Column2: 'B'
    Column3: 'ABC'
 >> n(3)
 ans = 
    Column1: '3'
    Column2: 'C'
    Column3: 'ABC DEF'
 >> str2double({n(:).Column1})
 ans =
     1     2     3     4     5     6     7     8     9    10

etc .. here I used named tokens and a struct array output, just for the fun of it. I don't think that it is what you are looking for, but I just wanted to illustrated a regexp-based approach for the record.

=== Reading array of chars and converting to cell array based on position of spaces and \n and/or \r:

... to update if asked by OP.

=== Using FSCANF:

.. and the following, more regular content:

 1 A ABC
 2 B ABC
 3 C ABC
 4 D ABC
 5 E ABC
 6 F ABC
 7 G ABC
 8 H ABC
 9 I ABC
 10 J ABC

Code:

 fid  = fopen('data_regular.txt', 'r') ;
 data = cell(1e6, 3) ;                    % Prealloc.
 rCnt = 0 ;                               % Row counter.
 while ~feof(fid)
    rCnt = rCnt + 1 ;
    data{rCnt,1} = fscanf(fid, '%d', 1) ;
    data{rCnt,2} = fscanf(fid, '%s', 1) ;
    data{rCnt,3} = fscanf(fid, '%s', 1) ;
 end
 fclose(fid) ;
 data = data(1:rCnt,:) ;                  % Truncate.

Using this, we get:

 >> data
 data = 
    [ 1]    'A'    'ABC'
    [ 2]    'B'    'ABC'
    [ 3]    'C'    'ABC'
    [ 4]    'D'    'ABC'
    [ 5]    'E'    'ABC'
    [ 6]    'F'    'ABC'
    [ 7]    'G'    'ABC'
    [ 8]    'H'    'ABC'
    [ 9]    'I'    'ABC'
    [10]    'J'    'ABC'

Note that EOF should be tested a little better (and not every three FSCANF, which assumes a well formed file). The whole could be in a TRY/CATCH statement otherwise.

=== Using FGETL + SSCANF:

It is more complicated than FSCANF, because the later moves forward an internal file pointer/counter as it reads the content, so the next read operation takes what follows. SSCANF doesn't work like this and you have to indicate what to extract and what to skip in the format. To illustrate:

 >> s = '12 A ABC' ;
 >> sscanf(s, '%d')                 % OK for the number.
 ans =
     12
 >> sscanf(s, '%s')                 % Can we do the same for the 2nd col? KO.
 ans =
 12AABC
 >> sscanf(s, '%*d %s', 1)          % Skip # and read a 1 char string => KO, ASCII.
 ans =
 65
 >> char(sscanf(s, '%*d %s', 1))    % => char, OK.
 ans =
 A
 >> char(sscanf(s, '%*d %s %*s'))   % Or read a string and skip next.
 ans =
 A
 >> char(sscanf(s, '%*d %*s %s'))   % Same for 3rd column, but dim KO.
 ans =
 A
 B
 C
 >> char(sscanf(s, '%*d %*s %s')).' % Transpose, OK.
 ans =
 ABC

Best Answer

Related Solutions

MATLAB: Can rectangularized data be achieved by modifying a text file with MATLAB

MATLAB: How to read strings from file with fscanf or sscanf (NOT textscan)

Related Question