MATLAB: How to parse this complex text file with textscan

blockblocksdebugdebugging sessionMATLABparsereadregexptext filetext;textscan

I have a text file that is in a rather funky format. The file comes out of a relational database (Antelope) and consists of earthquake location, dates, times, phase information, etc. I need to parse out and collect the 'data blocks' that are in between each header line. I need the header lines as well for each "block". I have edited the file to include an EOB (end of block) marker to make this task easier, but it's not as trivial as I thought. Here's an image of the first 68 or so lines (out of about 1 million).

I'd like to pull the 4 columns below each header…. for example the first section is:

 2015  1 22  0  8 58.537   45.97929 -129.98717  1.184  0.0  1.039  3.621  0.036      1
 AXCC1  0.843 1.00 P
 AXAS2  1.263 1.00 P
 AXEC1  0.923 1.00 P
 AXEC2  1.103 1.00 P
 AXEC3  1.088 1.00 P
 AXCC1  1.873 0.25 S
 AXAS1  2.728 0.06 S
 AXAS2  2.168 0.25 S
 AXEC1  1.708 0.33 S
 AXEC2  2.043 0.25 S
 AXEC3  2.113 0.25 S

and put those in an array. But I need to be able to associate the header line, specifically the last integer in the header line (1 in this case), with each code block.

So far my code looks like this, but obviously it is not working yet. I don't get any errors but it's missing and skipping data etc.

fid=fopen('ph2dt_catalog8_edit.dat');
Block=1;
while (~feof(fid))
      InputText=textscan(fid,'%s',1,'delimiter','\n');
      HeaderLines{Block,1}=InputText{1};
      disp(HeaderLines{Block});
      FormatString='%s%f%f%s'; 
      InputText=textscan(fid, FormatString, 'delimiter','WhiteSpace','CollectOutput',1);
      Data{Block,1} = cell2mat(InputText{2});    
      [NumRows,NumCols] = size(Data{Block}); 
      eob=textscan(fid,'%s',1,'delimiter','\n');
      Block=Block +1;
end

Can anyone offer any suggestions. Let me know if I need to clarify anything further.

Best Answer

Assumptions

Speed is important - "any ideas on faster method?"
The text file fits in memory - "The entire file is about 23 MB."
The station names are exactly five characters - "5" appears in the code as a magic number
The value of PHA is exactly one character
The line separator is "", i.e char(10)
The header lines begin with 2014,2015,2016 or 2017 (and are the only lines to begin so).

Approach

Read the entire file into a character string.
Split the string into a cell array of strings, with one block in each cell
Pre-allocate output variables based on the size of the string and the cell array
Loop over all blocks and parse one block at a time

I tested with community_edit_2.txt, which is community_edit.txt with the # removed.

STA and PHA are character arrays rather than cell arrays of strings. That's somewhat faster

function [ ORG, ARV, STA, PHA, EVD ] = cssm( filespec )
    str = fileread( filespec );
    xpr = '(?<=(^|\n))[ ]*201[4567].+?(?=($|[ ]*201[4567]))';
    blocks = regexp( str, xpr, 'match' );
    nnl = length( strfind( str, char(10) ) ); 
    len = length( blocks );
    ORG = nan(len,14); 
    %


    N   = nnl - len + 1;
    STA = repmat( '-', [N,5] ); 
    ARV = nan(N,1); 
    PHA = repmat( '-', [N,1] );  
    EVD = nan(N,1);
    nextORG = 1;
    nextSTA = 1;
    for cac = blocks
        S0  = regexp( cac{1}, '\n', 'split', 'once' );
        S1  = textscan( S0{1}, '%f%f%f%f%f%f%f%f%f%f%f%f%f%f' ...
                    ,   'CollectOutput',true                  );
        ORG( nextORG, : ) = S1{1};
        MATDAY_ARV = datenum( S1{1}(1:6) ); %#ok<NASGU> 
        nextORG = nextORG + 1;
        %
        S2  = textscan( S0{2}, '%5c%f%f%1c' );
        N2  = size( S2{1}, 1 );
        STA( nextSTA:nextSTA+N2-1, : ) = S2{1}; 
        ARV( nextSTA:nextSTA+N2-1, 1 ) = S2{2}; 
        PHA( nextSTA:nextSTA+N2-1, 1 ) = S2{4}; 
        EVD( nextSTA:nextSTA+N2-1, : ) = S1{1}(end); 
        nextSTA = nextSTA + N2;
    end
    %
    if  N >= nextSTA % truncate the "memory", which isn't used.  
        STA( STA == '-' ) = [];
        STA = reshape( STA, [],5 );
        ARV( nextSTA : end ) = [];
        PHA( nextSTA : end ) = [];
        EVD( nextSTA : end ) = [];
    end
end

Error handling: This file lacks error handling besides that of Matlab, e.g. fileread will tell if the text file is missing. If this function is intended for routine use it's important to handle especially the errors, which are caused by unexpected character strings in the input file.

2016-11-18, Performance test

Computer: eight year old vanilla desktop with 8GB RAM.
System: Windows7,64bit, Matlab R2016a,64bit
Test file: community_edit_1M.txt is 27.6MB, 95200 blocks, 1097181 lines. It's created by concatenating copies of community_edit.txt and removing the #.

>> filespec = 'h:\m\cssm\community_edit_1M.txt';
>> tic,[ORG0,ARV0,STA0,PHA0,EVD0] = cssm( filespec ); toc
Elapsed time is 22.443859 seconds.

Caveat: The text file was probably available in the system cache, since this was not cleared before the test.

Comparison: This is nearly five times faster than the function, asd

>> filespec = 'h:\m\cssm\community_edit_1M_EOB.txt';
>> tic, [Data, HeaderLines] = asd( filespec ); toc
Elapsed time is 101.202009 seconds.

Related Solutions

MATLAB: Issue with data format when using textscan()

The format should be all lower case for duration.

%{hh:mm:ss}T

However, the data appears to be delimited as semicolon. You'd have more luck with readtable:

opts = detectImportOptions('D:\SI010118.txt')
opts = setvartype(opts,1,'datetime')
opts = setvaropts(opts,1,'InputFormat','dd.MM.uuuu HH:mm:ss')
readtable('D:\SI010118.txt',opts)

MATLAB: Question about fgetl(fileID)

Try

    function    ccsm()
        fid = fopen( 'cssm.txt', 'r' );
        cac = regexp( fgetl( fid ), '=', 'split' );
        patieritName = cac{2};
        cac = regexp( fgetl( fid ), '=', 'split' );
        dataofBwth = datenum( cac{2}, 'mm/ddyyyy' );
        cac = regexp( fgetl( fid ), '=', 'split' );
        heaFthy_exposed = str2double(cac{2});
        cac = regexp( fgetl( fid ), '=', 'split' );
        pus = str2double(cac{2});
        cac = regexp( fgetl( fid ), '=', 'split' );
        necrotic = str2double(cac{2});
        cac = regexp( fgetl( fid ), '=', 'split' );
        ulcer_stage = str2double(cac{2});
        cac = regexp( fgetl( fid ), '=', 'split' );
        area = str2double(cac{2});
        cac = regexp( fgetl( fid ), '=', 'split' );
        volume = str2double(cac{2});
        fclose('all');
    end

where cssm.txt contains

    patieritName = John Doe
    dataofBwth = 04/2511987
    heaFthy_exposed =75
    pus = 10
    necrotic = 15
    ulcer_stage = 3
    area = 89
    volume = 90

Best Answer

Related Solutions

MATLAB: Issue with data format when using textscan()

MATLAB: Question about fgetl(fileID)

Related Question