MATLAB: How, if possible, do I limit the number of times REGEXP searches for a specific pattern

MATLABregexp

I’m using a regular expression to search blocks of text that look like the following;

MSN_BER (0:31) Observation #1 Rx'd at:  (58570.000) Msg. Time:  (58568.000)
  Forward to IMU: true   Rcv Date: 2010121   Synch: f0f0   Rep Mode: Replay_Mode
  State Time:            12:00:00.000   (58571.000)
  State Position:       -1500.0000, -5000.0000, 4100.0000
MSN_RAM (0:32) Observation #20 Rx'd at:  (58569.000) Msg. Time:  (58569.000)
  Forward to IMU: true   Rcv Date: 2010121   Synch: f0f0   Rep Mode: Replay_Mode
  Fmt: 10 (AIRBORN__ARRAY_LOT)  Length: 5678   Remote Num: 1   Number of Obsevations: 1
Type: 1 Track ID: 12345 Time Tag: 58573.00000000
   Band ID: 1   AD ID:   21 Scan ID: 0  LRT/HRT: 1  Valid Flag: 0
MSN_RAM (0:32) Observation #30 Rx'd at:  (58569.000) Msg. Time:  (58569.000)
  Forward to IMU: true   Rcv Date: 2010121   Synch: f0f0   Rep Mode: Replay_Mode
  Fmt: 10 (AIRBORN__ARRAY_LOT)  Length: 5678   Remote Num: 1   Number of Obsevations: 2
Type: 1 Track ID: 12345 Time Tag: 58583.00000000
   Band ID: 1   AD ID:   31 Scan ID: 0  LRT/HRT: 1  Valid Flag: 0
Type: 1 Track ID: 12345 Time Tag: 58585.00000000
   Band ID: 1   AD ID:   32 Scan ID: 0  LRT/HRT: 1  Valid Flag: 0

Note: There is no 2nd MSN_BER data block.

I’m using the following search pattern and REGEXP function to extract the time tag and AD ID values:

exp = '([\d\.]+)\s+Band[^A]+?AD ID:\s+(\d+).';
tokens3 = regexp(bufferSplit{BlockId}, exp, 'tokens');

This results in: tokens3 = {1×2 cell} {1×2 cell} {1×2 cell},

where the time tag and AD ID are contained in the cells for each occurrence in the block of text.

>> tokens3{1,1}

ans = '58573.00000000' '21'

>> tokens3{1,2}

ans = '58583.00000000' '31'

>> tokens3{1,3}

ans = '58585.00000000' '32'

What I’m attempting to accomplish is limit the search pattern. Specifically, limit the number of times to search for the time tag and AD ID values based on the fact that there is no 2nd MSN_BER data block. I know the command option 'once' will return only the first match found. However, there could be multiple occurrences of the AD ID and its associated time tag.

The result of this would be: tokens3 = {1×2 cell}

>> tokens3{1,1}

ans = '58573.00000000' '21'

Can this be accomplished using the REGEXP function?

Best Answer

I'll answer assuming that my last comment under your question is correct. It is nice to implement complex regular expressions for learning, but in practice one often gets better results by splitting a one shot complex call/pattern into a series of simpler calls/patterns. Here is an example: I am using the following content:

 MSN_BER (0:31) Observation #1 Rx'd at:  (58570.000) Msg. Time:  (58568.000)
  Forward to IMU: true   Rcv Date: 2010121   Synch: f0f0   Rep Mode: Replay_Mode
  State Time:            12:00:00.000   (58571.000)
  State Position:       -1500.0000, -5000.0000, 4100.0000
 MSN_RAM (0:32) Observation #20 Rx'd at:  (58569.000) Msg. Time:  (58569.000)
  Forward to IMU: true   Rcv Date: 2010121   Synch: f0f0   Rep Mode: Replay_Mode
  Fmt: 10 (AIRBORN__ARRAY_LOT)  Length: 5678   Remote Num: 1   Number of Obsevations: 1
 Type: 1 Track ID: 12345 Time Tag: 58573.00000000
   Band ID: 1   AD ID:   21 Scan ID: 0  LRT/HRT: 1  Valid Flag: 0
 Type: 1 Track ID: 12345 Time Tag: 58574.00000000
   Band ID: 1   AD ID:   21 Scan ID: 0  LRT/HRT: 1  Valid Flag: 0
 MSN_RAM (0:32) Observation #30 Rx'd at:  (58569.000) Msg. Time:  (58569.000)
  Forward to IMU: true   Rcv Date: 2010121   Synch: f0f0   Rep Mode: Replay_Mode
  Fmt: 10 (AIRBORN__ARRAY_LOT)  Length: 5678   Remote Num: 1   Number of Obsevations: 2
 Type: 1 Track ID: 12345 Time Tag: 58583.00000000
   Band ID: 1   AD ID:   31 Scan ID: 0  LRT/HRT: 1  Valid Flag: 0
 Type: 1 Track ID: 12345 Time Tag: 58585.00000000
   Band ID: 1   AD ID:   32 Scan ID: 0  LRT/HRT: 1  Valid Flag: 0
 MSN_BER (0:31) Observation #1 Rx'd at:  (58570.000) Msg. Time:  (58568.000)
  Forward to IMU: true   Rcv Date: 2010121   Synch: f0f0   Rep Mode: Replay_Mode
  State Time:            12:00:00.000   (58571.000)
  State Position:       -1500.0000, -5000.0000, 4100.0000
 MSN_RAM (0:32) Observation #20 Rx'd at:  (58569.000) Msg. Time:  (58569.000)
  Forward to IMU: true   Rcv Date: 2010121   Synch: f0f0   Rep Mode: Replay_Mode
  Fmt: 10 (AIRBORN__ARRAY_LOT)  Length: 5678   Remote Num: 1   Number of Obsevations: 1
 Type: 1 Track ID: 12345 Time Tag: 58578.00000000
   Band ID: 1   AD ID:   41 Scan ID: 0  LRT/HRT: 1  Valid Flag: 0
 Type: 1 Track ID: 12345 Time Tag: 58579.00000000
   Band ID: 1   AD ID:   41 Scan ID: 0  LRT/HRT: 1  Valid Flag: 0

which is made of two MSN_BER/MSN_RAM blocks framing an MSN_RAM only block. I assume that you want to get all AD IDs and time tags of MSN_BER/MSN_RAM blocks.

The first step is to read the file and get valid MSN_BER/MSN_RAM blocks:

 content = fileread( 'bradFile.txt' ) ;
 BER_blocks = regexp( content, 'MSN_BER.+?RAM(?:[^R]+|R(?!AM))*', 'match' ) ;

Running this produces..

 >> BER_blocks
 BER_blocks = 
    [1x766 char]    [1x762 char]

If you display these two blocks, you'll see that the first doesn't include the MSN_RAM block. The first part of the pattern is trivial, and the second part matches all characters which are not 'R' or all 'R''s not followed by 'AM'. This is one (not too inefficient) way to exclude a given string from the match.

The second step is to extract AD IDs and time tags from each block.

 data = cell( size( BER_blocks )) ;
 for bId = 1 : numel( BER_blocks )
    tokens = regexp( BER_blocks{bId}, 'Time Tag:\s*([\d\.]+).+?AD ID:\s*(\d+)', ...
                     'tokens' ) ;
    data{bId} = reshape( str2double( [tokens{:}] ), 2, [] ).' ;
 end

Which leads, based on the above content, to the following data cell array (each cell contains time tag and AD ID of one MSN_BER/MSN_RAM block) ..

 >> celldisp( data )
 data{1} =
        58573          21
        58574          21
 data{2} =
        58578          41
        58579          41

You can then concatenate these cells' content if you want to have one big array instead of one array per block:

 >> data = vertcat( data{:} )
 data =
       58573          21
       58574          21
       58578          41
       58579          41

Let me know if it's not what you wanted.

Related Solutions

MATLAB: Can REGEXP or TEXTSCAN be used to split 2 distinct data sets from a single text file

Try this:

    str = fileread('your_file.txt');
    ca1 = regexp( str, 'MSN_JET.+?(?=(MSN_SENSUM)|($))', 'match' );
    ca2 = regexp( str, 'MSN_SENSUM.+?(?=(MSN_JET)|($))', 'match' );

remains to print the two files. This process does not removed any new-line-characters.

MATLAB: What would be the best approach to solve this data mapping problem

It is not trivial in the sense that REGEXP provides you with two series of data with no information for relating one to the other. I see two options (without thinking too much)..

1. Instead of calling REGEXP twice, you call it a first time to get blocks based on a split matching 'MSN_BER'. You can then loop over these blocks and extract data that are to be mapped. E.g. (not tested):

EDIT: splitting using REGEXP is simpler than my first proposal..

 bufferSplit = regexp(buffer, 'MSN_BER', 'split') ;
 for bId = 1 : length(bufferSplit)
    if isempty(bufferSplit{bId}),  continue ;  end
    % Here, your code based on two REGEXP using bufferSplit{bId} 
    % instead of buffer.
 end

this way you know that, at each step of the loop, BER_State_Data and AC12_Data belong to the same block.

First proposal (I leave it for the record):

 startPos = regexp(buffer, 'MSN_BER', 'start') ;
 nBlocks = length(startPos) ;
 for bId = 1 : nBlocks
    if bId < nBlocks
        miniBuffer = buffer(startPos(bId):startPos(bId+1)-1) ;
    else
        miniBuffer = buffer(startPos(bId):end) ;
    end
    % Here, your code based on two REGEXP using miniBuffer instead of buffer.
 end

2. If you can count on the fact (?) that the 'Time Tag:' field associated with entries that belong to the same block as a given BER entry are <= the 'Rx'd at:' (or State Time) field of the BER entry, then you can build the join directly from what you already have, using the 2nd column of BER_State_Data and the first column of AC12_Data. E.g. (not tested):

 for berId = 1 : size(BER_State_Data, 1)
    if berId == 1,  prev = 0 ;  else prev = BER_State_Data(berId-1,2) ;  end
    ac12Ids = AC12_data(:,1)>prev & AC12_data(:,1)<=BER_State_Data(berId,2) ;
    % Here you build whatever you want with
    %   BER_State_Data(berId,:)   and    AC12_data(ac12Ids,:)
 end

=========================================================

PS: if you are the Brad who asked earlier about calling various functions based on a "per column" function ID, here is one example:

 f{1} = @sin ;
 f{2} = @(x) x.^(1/2) ;
 f{3} = @(x) -x ;
 M = magic(8)
 c = [1, 1, 1, 2, 2, 3, 3, 3] ;
 fM = arrayfun(@(cId) f{c(cId)}(M(:,cId)), 1:length(c), 'UniformOutput', false);
 cell2mat(fM)

Best Answer

Related Solutions

MATLAB: Can REGEXP or TEXTSCAN be used to split 2 distinct data sets from a single text file

MATLAB: What would be the best approach to solve this data mapping problem

Related Question