MATLAB: Can REGEXP map values from different parts of a text file

lookaroundMATLABstring matchstring search

I have a text file with the following contents:
MSNout_BER (0:31) Observation #100 Rx'd at: (58568.000) Msg. Time: (58568.000)
Forward to IMU: true Rcv Date: 2010121 Synch: f0f0 Rel Mode: Active
MSNout_SSS (0:32) Observation #101 Rx'd at: (58569.000) Msg. Time: (58569.000)
Forward to IRU: true Rcv Date: 2010121 Synch: a0a0 Bel Mode: High
Type: 12 Malck ID: 12345 Time Tag: 58548.12345678
Hand ID: 0 SV ID: 51 Spam ID: 0 BOZ/FAS: 0 Realt Flag: 0
MSNout_BER (0:33) Observation #102 Rx'd at: (58570.000) Msg. Time: (58570.000)
Forward to IMU: true Rcv Date: 2010121 Synch: f0f0 Rel Mode: Active
MSNout_SSS (0:34) Observation #103 Rx'd at: (58571.000) Msg. Time: (58571.000)
Forward to IRU: true Rcv Date: 2010121 Synch: a0a0 Bel Mode: High
Type: 1 Malck ID: 12345 Time Tag: 58549.12345678
Hand ID: 1 SV ID: 2 Spam ID: 0 BOZ/FAS: 1 Realt Flag: 0
Type: 1 Malck ID: 12345 Time Tag: 58550.12345678
Hand ID: 1 SV ID: 2 Spam ID: 0 BOZ/FAS: 1 Realt Flag: 0
Type: 1 Malck ID: 12345 Time Tag: 58551.12345678
Hand ID: 1 SV ID: 2 Spam ID: 0 BOZ/FAS: 1 Realt Flag: 0
Type: 1 Malck ID: 12345 Time Tag: 58552.12345678
Hand ID: 1 SV ID: 2 Spam ID: 0 BOZ/FAS: 1 Realt Flag: 0
Type: 1 Malck ID: 12345 Time Tag: 58553.12345678
Hand ID: 1 SV ID: 1 Spam ID: 0 BOZ/FAS: 1 Realt Flag: 0
Type: 1 Malck ID: 12345 Time Tag: 58554.12345678
Hand ID: 1 SV ID: 1 Spam ID: 0 BOZ/FAS: 1 Realt Flag: 0
Type: 1 Malck ID: 12345 Time Tag: 58555.12345678
Hand ID: 1 SV ID: 1 Spam ID: 0 BOZ/FAS: 1 Realt Flag: 0
Type: 1 Malck ID: 12345 Time Tag: 58556.12345678
Hand ID: 1 SV ID: 3 Spam ID: 0 BOZ/FAS: 1 Realt Flag: 0
I’m using the following commands to retrieve the values for the Time Tag: and SV ID: (values 1 and 2 only, all others are ignored);
[fn,pn] = uigetfile('*.txt,"Select Text File');
OAMfilename = fullfile(pn, fn);
buffer = fileread(OAMfilename);
pattern = '*?Tag:\s+([\d\.]+).*?SV ID:\s+([12])\W';
tokens = regexp(buffer, pattern, 'tokens');
data = reshape(str2double([tokens{:}]), 2, []).';
Results:
58548.1234567800 2
58550.1234567800 2
58551.1234567800 2
58552.1234567800 2
58553.1234567800 1
58554.1234567800 1
58555.1234567800 1
Initially, I thought the results were as expected. Then I noticed the time tag for the first occurrence of SV ID equal to 2 was wrong – 58549.12345678 is the proper time tag.
Is it possible to force MATLAB to recognize each Time Tag value that occurs just prior to each SV ID value? Could a Lookaround operator be used in this case?

Best Answer

This seems to work.
buf = fileread( 'cssm.txt' );
rex = '(?<=Time Tag: )([\d\.]+).+?(?<=SV ID:[ ]+)(\d+)';
cac = regexp( buf, rex, 'tokens' );
cac{:}
returns
ans =
'58548.12345678' '51'
ans =
'58549.12345678' '2'
ans =
'58550.12345678' '2'
ans =
'58551.12345678' '2'
ans =
'58552.12345678' '2'
ans =
'58553.12345678' '1'
ans =
'58554.12345678' '1'
ans =
'58555.12345678' '1'
ans =
'58556.12345678' '3'
where cssm.txt contains your data
.
Comments on the regular expression:
  • capture tokens
  • capture the group of digits, which follow after identifiers and space
  • the "identifiers and space" are used as expressions in look behind operators
  • thus two groups of (?<= name)( value)
  • between these two groups: .+?, which is a Lazy Quantifier. It advances the current position one position or more, but only as much of the quantified expression as necessary.
  • the regular expression must match one sub-string, thus something is needed to match the characters between the two groups to make the two one sub-string. In this case that is done by .+?.
Most of the italic words are copy&paste from the on-line help.
.
BTW: Your pattern works - after a little fixing:
rex = '*?Tag:\s+([\d\.]+).*?SV ID:\s+([125]{1,2})\W';
but what is the purpose of the leading *? and the trailing \W ?
.
A bit more robust:
rex = '(?<=Time Tag:)[ ]+([\d\.]+)[^\n]+?(?<=SV ID:)[ ]+(\d+)';
  • Replacing \s+ between name and value by [ ]+ excludes new-line, tab, etc.
  • Replacing .*? between the two name-value-pairs by |[^