MATLAB: Can REGEXP map values from different parts of a text file

lookaroundMATLABstring matchstring search

I have a text file with the following contents:

MSNout_BER (0:31) Observation #100 Rx'd at:  (58568.000) Msg. Time: (58568.000)
    Forward to IMU: true   Rcv Date: 2010121   Synch: f0f0   Rel Mode: Active
MSNout_SSS (0:32) Observation #101 Rx'd at:  (58569.000) Msg. Time: (58569.000)
    Forward to IRU: true   Rcv Date: 2010121   Synch: a0a0   Bel Mode: High
Type: 12    Malck ID: 12345 Time Tag: 58548.12345678
Hand ID: 0  SV ID:   51 Spam ID: 0  BOZ/FAS: 0  Realt Flag: 0
MSNout_BER (0:33) Observation #102 Rx'd at:  (58570.000) Msg. Time: (58570.000)
    Forward to IMU: true   Rcv Date: 2010121   Synch: f0f0   Rel Mode: Active
MSNout_SSS (0:34) Observation #103 Rx'd at:  (58571.000) Msg. Time: (58571.000)
    Forward to IRU: true   Rcv Date: 2010121   Synch: a0a0   Bel Mode: High
Type: 1 Malck ID: 12345 Time Tag: 58549.12345678
Hand ID: 1  SV ID:   2  Spam ID: 0  BOZ/FAS: 1  Realt Flag: 0
Type: 1 Malck ID: 12345 Time Tag: 58550.12345678
Hand ID: 1  SV ID:   2  Spam ID: 0  BOZ/FAS: 1  Realt Flag: 0
Type: 1 Malck ID: 12345 Time Tag: 58551.12345678
Hand ID: 1  SV ID:   2  Spam ID: 0  BOZ/FAS: 1  Realt Flag: 0
Type: 1 Malck ID: 12345 Time Tag: 58552.12345678
Hand ID: 1  SV ID:   2  Spam ID: 0  BOZ/FAS: 1  Realt Flag: 0
Type: 1 Malck ID: 12345 Time Tag: 58553.12345678
Hand ID: 1  SV ID:   1  Spam ID: 0  BOZ/FAS: 1  Realt Flag: 0
Type: 1 Malck ID: 12345 Time Tag: 58554.12345678
Hand ID: 1  SV ID:   1  Spam ID: 0  BOZ/FAS: 1  Realt Flag: 0
Type: 1 Malck ID: 12345 Time Tag: 58555.12345678
Hand ID: 1  SV ID:   1  Spam ID: 0  BOZ/FAS: 1  Realt Flag: 0
Type: 1 Malck ID: 12345 Time Tag: 58556.12345678
Hand ID: 1  SV ID:   3  Spam ID: 0  BOZ/FAS: 1  Realt Flag: 0

I’m using the following commands to retrieve the values for the Time Tag: and SV ID: (values 1 and 2 only, all others are ignored);

[fn,pn] = uigetfile('*.txt,"Select Text File');
OAMfilename = fullfile(pn, fn);
buffer  = fileread(OAMfilename);
pattern = '*?Tag:\s+([\d\.]+).*?SV ID:\s+([12])\W';
tokens = regexp(buffer, pattern, 'tokens');
data = reshape(str2double([tokens{:}]), 2, []).';

Results:

58548.1234567800  2
58550.1234567800  2
58551.1234567800  2
58552.1234567800  2
58553.1234567800  1
58554.1234567800  1
58555.1234567800  1

Initially, I thought the results were as expected. Then I noticed the time tag for the first occurrence of SV ID equal to 2 was wrong – 58549.12345678 is the proper time tag.

Is it possible to force MATLAB to recognize each Time Tag value that occurs just prior to each SV ID value? Could a Lookaround operator be used in this case?

Best Answer

This seems to work.

    buf = fileread( 'cssm.txt' );
    rex = '(?<=Time Tag: )([\d\.]+).+?(?<=SV ID:[ ]+)(\d+)';
    cac = regexp( buf, rex, 'tokens' );
    cac{:}

returns

    ans = 
        '58548.12345678'    '51'
    ans = 
        '58549.12345678'    '2'
    ans = 
        '58550.12345678'    '2'
    ans = 
        '58551.12345678'    '2'
    ans = 
        '58552.12345678'    '2'
    ans = 
        '58553.12345678'    '1'
    ans = 
        '58554.12345678'    '1'
    ans = 
        '58555.12345678'    '1'
    ans = 
        '58556.12345678'    '3'

where cssm.txt contains your data

Comments on the regular expression:

capture tokens
capture the group of digits, which follow after identifiers and space
the "identifiers and space" are used as expressions in look behind operators
thus two groups of (?<= name)( value)
between these two groups: .+?, which is a Lazy Quantifier. It advances the current position one position or more, but only as much of the quantified expression as necessary.
the regular expression must match one sub-string, thus something is needed to match the characters between the two groups to make the two one sub-string. In this case that is done by .+?.

Most of the italic words are copy&paste from the on-line help.

BTW: Your pattern works - after a little fixing:

    rex = '*?Tag:\s+([\d\.]+).*?SV ID:\s+([125]{1,2})\W';

but what is the purpose of the leading *? and the trailing \W ?

A bit more robust:

    rex = '(?<=Time Tag:)[ ]+([\d\.]+)[^\n]+?(?<=SV ID:)[ ]+(\d+)';

Replacing \s+ between name and value by [ ]+ excludes new-line, tab, etc.
Replacing .*? between the two name-value-pairs by |[^

Related Solutions

MATLAB: How, if possible, do I limit the number of times REGEXP searches for a specific pattern

I'll answer assuming that my last comment under your question is correct. It is nice to implement complex regular expressions for learning, but in practice one often gets better results by splitting a one shot complex call/pattern into a series of simpler calls/patterns. Here is an example: I am using the following content:

 MSN_BER (0:31) Observation #1 Rx'd at:  (58570.000) Msg. Time:  (58568.000)
  Forward to IMU: true   Rcv Date: 2010121   Synch: f0f0   Rep Mode: Replay_Mode
  State Time:            12:00:00.000   (58571.000)
  State Position:       -1500.0000, -5000.0000, 4100.0000
 MSN_RAM (0:32) Observation #20 Rx'd at:  (58569.000) Msg. Time:  (58569.000)
  Forward to IMU: true   Rcv Date: 2010121   Synch: f0f0   Rep Mode: Replay_Mode
  Fmt: 10 (AIRBORN__ARRAY_LOT)  Length: 5678   Remote Num: 1   Number of Obsevations: 1
 Type: 1 Track ID: 12345 Time Tag: 58573.00000000
   Band ID: 1   AD ID:   21 Scan ID: 0  LRT/HRT: 1  Valid Flag: 0
 Type: 1 Track ID: 12345 Time Tag: 58574.00000000
   Band ID: 1   AD ID:   21 Scan ID: 0  LRT/HRT: 1  Valid Flag: 0
 MSN_RAM (0:32) Observation #30 Rx'd at:  (58569.000) Msg. Time:  (58569.000)
  Forward to IMU: true   Rcv Date: 2010121   Synch: f0f0   Rep Mode: Replay_Mode
  Fmt: 10 (AIRBORN__ARRAY_LOT)  Length: 5678   Remote Num: 1   Number of Obsevations: 2
 Type: 1 Track ID: 12345 Time Tag: 58583.00000000
   Band ID: 1   AD ID:   31 Scan ID: 0  LRT/HRT: 1  Valid Flag: 0
 Type: 1 Track ID: 12345 Time Tag: 58585.00000000
   Band ID: 1   AD ID:   32 Scan ID: 0  LRT/HRT: 1  Valid Flag: 0
 MSN_BER (0:31) Observation #1 Rx'd at:  (58570.000) Msg. Time:  (58568.000)
  Forward to IMU: true   Rcv Date: 2010121   Synch: f0f0   Rep Mode: Replay_Mode
  State Time:            12:00:00.000   (58571.000)
  State Position:       -1500.0000, -5000.0000, 4100.0000
 MSN_RAM (0:32) Observation #20 Rx'd at:  (58569.000) Msg. Time:  (58569.000)
  Forward to IMU: true   Rcv Date: 2010121   Synch: f0f0   Rep Mode: Replay_Mode
  Fmt: 10 (AIRBORN__ARRAY_LOT)  Length: 5678   Remote Num: 1   Number of Obsevations: 1
 Type: 1 Track ID: 12345 Time Tag: 58578.00000000
   Band ID: 1   AD ID:   41 Scan ID: 0  LRT/HRT: 1  Valid Flag: 0
 Type: 1 Track ID: 12345 Time Tag: 58579.00000000
   Band ID: 1   AD ID:   41 Scan ID: 0  LRT/HRT: 1  Valid Flag: 0

which is made of two MSN_BER/MSN_RAM blocks framing an MSN_RAM only block. I assume that you want to get all AD IDs and time tags of MSN_BER/MSN_RAM blocks.

The first step is to read the file and get valid MSN_BER/MSN_RAM blocks:

 content = fileread( 'bradFile.txt' ) ;
 BER_blocks = regexp( content, 'MSN_BER.+?RAM(?:[^R]+|R(?!AM))*', 'match' ) ;

Running this produces..

 >> BER_blocks
 BER_blocks = 
    [1x766 char]    [1x762 char]

If you display these two blocks, you'll see that the first doesn't include the MSN_RAM block. The first part of the pattern is trivial, and the second part matches all characters which are not 'R' or all 'R''s not followed by 'AM'. This is one (not too inefficient) way to exclude a given string from the match.

The second step is to extract AD IDs and time tags from each block.

 data = cell( size( BER_blocks )) ;
 for bId = 1 : numel( BER_blocks )
    tokens = regexp( BER_blocks{bId}, 'Time Tag:\s*([\d\.]+).+?AD ID:\s*(\d+)', ...
                     'tokens' ) ;
    data{bId} = reshape( str2double( [tokens{:}] ), 2, [] ).' ;
 end

Which leads, based on the above content, to the following data cell array (each cell contains time tag and AD ID of one MSN_BER/MSN_RAM block) ..

 >> celldisp( data )
 data{1} =
        58573          21
        58574          21
 data{2} =
        58578          41
        58579          41

You can then concatenate these cells' content if you want to have one big array instead of one array per block:

 >> data = vertcat( data{:} )
 data =
       58573          21
       58574          21
       58578          41
       58579          41

Let me know if it's not what you wanted.

MATLAB: Can REGEXP or TEXTSCAN be used to split 2 distinct data sets from a single text file

Try this:

    str = fileread('your_file.txt');
    ca1 = regexp( str, 'MSN_JET.+?(?=(MSN_SENSUM)|($))', 'match' );
    ca2 = regexp( str, 'MSN_SENSUM.+?(?=(MSN_JET)|($))', 'match' );

remains to print the two files. This process does not removed any new-line-characters.

Best Answer

Related Solutions

MATLAB: How, if possible, do I limit the number of times REGEXP searches for a specific pattern

MATLAB: Can REGEXP or TEXTSCAN be used to split 2 distinct data sets from a single text file

Related Question