MATLAB: Fastest way to add string

freadMATLABregexprep

I'm dealing with very large csv files. I'm having little to no problem with speed in reading from them with readtable. However, I have found (and reported) a bug in readtable where a blank value in the first column (the line starts with the delimiter, e.g. ',') throws off all the data. A lot of my files have blank values in the first column (due to the way the equipment I'm using records the data)

So, I have to "preprocess" the files and look for these blank columns in the csv file. The most efficient method I've found is the following:

fprintf('Reading File...');
ch = fread(YGID, [1,chunksize], 'int8=>char');
%cch = char(ch');
fprintf('Getting Number Of Lines...');
nol = sum(ch == sprintf('\n')); % number of lines
fprintf('%i\n',nol);
fprintf('Replacing final commas...\n');
cch = regexprep(ch,',(\r|\n)+','$1');
clear ch;
fprintf('Getting line locations...\n');
hlocs = regexp(cch,'\n');
fprintf('Writing Header File...\n');
fwrite(HDID,cch(hlocs(2)+1:hlocs(10)));
fprintf('Replacing Initial Commas\n');
ccch = regexprep(cch,'(\r|\n)+,','$1 ,');

YGID is the file pointer from an fopen. Note that I'm purposely making new variables (not memory efficient) as I have 16 GB of RAM available on my machine and I find making a completely new variable is faster. However, once the file is of a sufficient size (>20 MB, I have some over 200MB), even this becomes very slow. The line it is getting stuck on is "ccch = regexprep(cch,'(\r|\n)+,','$1 ,');" I suspect it's because with each additional space being added (there are hundreds of thousands) it's reallocating memory for the variable. I've tried to "preallocate" the new variable with "ccch = blanks(chunksize + nol);" before it and it didn't seem to make a difference.

Is there any more efficient way to do this task?

Best Answer

Found my own answer. strrep is surprisingly faster than regexprep I had to add a conditional to check the OS, though:

if ispc || isunix
    fpatt = sprintf('\n,');
    rpatt = sprintf('\n, ');
else
    fpatt = sprintf('\r,');
    rpatt = sprintf('\r, ');
end
ccch = strrep(cch,fpatt,rpatt);

Related Solutions

MATLAB: Parsing Formatted Text File Quickly

[Comment on textscan deleted.]

Below are three function, which read your file. You might want to modify a function so that text_file_name and number_of_data_columns are input arguments.

The functions return a structure array with one element per data block in the file.

I use a three year old vanilla Dell, with R2012a,64bit,Win7.

--- Reads the whole file to a string buffer and parses it in a second step ---

My test below returns a 0.9GB structure in 23 seconds.

    >> tic, S  = read_huge_CRLF_1(); toc
    Elapsed time is 22.613608 seconds.
    >> S
    S = 
    1x6060 struct array with fields:
        RowHeader
        Data
        Time
    >> S(1)
    ans = 
        RowHeader: {1042x1 cell}
             Data: [1042x3 double]
             Time: '2012-09-06'    
    >> whs = whos('S');
    >> whs.bytes/1e9
    ans =
        0.9114

where read_huge_CRLF_1

    function    S = read_huge_CRLF_1()
        str_buf = fileread( 'c:\MyData\Test\huge_CRLF.txt' );
        ix_list = strfind( str_buf, 'Timestamp:' );
        n_block = numel( ix_list );
        n_col   = 4; 
        frmt    = cat( 2, '%s', repmat( '%f', [ 1, n_col-1 ] ) );
        S   = struct( 'RowHeader'   , cell( 1, n_block )    ...
                    , 'Data'        , cell( 1, n_block )    ...
                    , 'Time'        , cell( 1, n_block )    ...
                    );
        for ii = 1 : n_block
            ix1 = ix_list(ii);
            if ii == n_block
                buf = str_buf( ix1 : end );
            else
                buf = str_buf( ix1 : ix_list(ii+1)-1 );
            end
            S(ii).Time  = sscanf( buf, 'Timestamp:%s', 1 );
            cac = textscan( buf             , frmt  ...
                        ,   'CollectOutput' , true  ...
                        ,   'HeaderLines'   , 1     ...
                        );
            S(ii).RowHeader = cac{1}; 
            S(ii).Data      = cac{2};
        end
    end

and where C:\MyData\Test\huge_CRLF.txt is 0.24GB and contains row like

    Header
    Timestamp: 2012-09-06 01:15 
    row1     3.1416    6.2832    9.4248
    row1     3.1416    6.2832    9.4248
    Timestamp: 2012-09-06 01:15 
    row1     3.1416    6.2832    9.4248
    row1     3.1416    6.2832    9.4248
    row1     3.1416    6.2832    9.4248
    Timestamp: 2012-09-06 01:15 
    row1     3.1416    6.2832    9.4248
    row1     3.1416    6.2832    9.4248
    row1     3.1416    6.2832    9.4248
    row1     3.1416    6.2832    9.4248
    row1     3.1416    6.2832    9.4248
    row1     3.1416    6.2832    9.4248

Comments

It possible to increase the speed somewhat.
The cost of "'Headerlines', 1" is surprisingly high. That might say more about me than than about textscan:).

--- Afterthought ---

The state of the file cache was not well defined when the test above was performed. Thus, I ran the test three times in a row directly after restart of the computer. I assume the text file is not in the cache. There is no real difference.

    % Restart computer
    >> tic, S  = read_huge_CRLF_1(); toc     % Free dropped to zero
    Elapsed time is 23.249289 seconds.       
    >> tic, S  = read_huge_CRLF_1(); toc     % 0.91GB S in base workspace 
    Elapsed time is 24.461189 seconds.
    >> tic, S  = read_huge_CRLF_1(); toc
    Elapsed time is 23.933955 seconds.

--- Reading blocks from cache is 20% faster. Doesn't read block_header ---

In this test testscan reads the blocks of data from the Windows' file cache. In the previous test testscan reads from a string buffer in the function workspace. The text file, huge_CRLF_4.txt, which is a copy of huge_CRLF.txt, was not in the cache before this test.

    >> clear('S'), tic, S  = read_huge_CRLF_3('c:\...\huge_CRLF_4.txt'); toc
    Elapsed time is 18.926222 seconds.
    >> clear('S'), tic, S  = read_huge_CRLF_3('c:\...\huge_CRLF_4.txt'); toc
    Elapsed time is 17.150977 seconds.
    >> clear('S'), tic, S  = read_huge_CRLF_3('c:\...\huge_CRLF_4.txt'); toc
    Elapsed time is 17.077009 seconds.
    >>

where read_huge_CRLF_3 is

    function    S = read_huge_CRLF_3( file_spec )
        if nargin == 0
            file_spec  = 'c:\MyData\Test\huge_CRLF_Sample.txt';
        end
        n_block_header_row  = 1;
        str_buf             = fileread( file_spec );
        ix_char_timestamp   = strfind( str_buf, 'Timestamp:' );
        ix_char_start_line  = [ 1, strfind( str_buf, char([13,10]) ) + 2 ];
        is_char_start_block = ismember( ix_char_start_line ...
                                      , ix_char_timestamp  ); 
        ii_line_start_block = ( 1 : 1 : size( ix_char_start_line, 2 ) );
        ii_line_start_block( not( is_char_start_block ) ) = [];
        n_block = numel( ii_line_start_block ); 
        n_col   = 4; 
        frmt    = cat( 2, '%s', repmat( '%f', [ 1, n_col-1 ] ) );
        S   = struct( 'BlockHeader' , cell( 1, n_block )    ...
                    , 'Data'        , cell( 1, n_block )    ...
                    );
        fid = fopen( file_spec', 'r' );
        if ii_line_start_block(1) >= 2
            cac = textscan( fid, '%[^\n\r]', ii_line_start_block(1)-1 ); 
        end
        iiBlock = 0;
        while not( feof( fid ) )
            iiBlock     = iiBlock + 1;
            if iiBlock == n_block 
                n_data_row = inf;
            else
                n_data_row  = ii_line_start_block( iiBlock+1 ) ...
                            - ii_line_start_block( iiBlock )   ...
                            - n_block_header_row               ; 
            end
            cac = textscan( fid                                     ...
                        ,   frmt            , n_data_row            ...
                        ,   'CollectOutput' , true                  ...
                        ,   'HeaderLines'   , n_block_header_row    ...
                        );
            S(iiBlock).BlockHeader  = cac{1}; 
            S(iiBlock).Data         = cac{2};
        end
        fclose( fid );
    end

Comments

The functions works as far as I can tell. However, this construct is erroneous

    fid = fopen( file_spec', 'r' );
    ...
    cac = textscan( fid, '%[^\n\r]', ii_line_start_block(1)-1 );

After it is executed the "file position indicator" is to the left of the EOL characters. See my question fgetl, textscan, and the file position indicator.

Adding "t" to the permission string, i.e.

    fid = fopen( file_spec', 'rt' );

does not solve the problem in my case. EOL is CRLF and the pointer will be positioned between the CR and LF. One solutions is adding "'Delimiter', '\n'" to the argument list of textscan.

--- My final function to read the file ---

Further refactored. The text file, huge_CRLF_5.txt, which is a copy of huge_CRLF.txt, was not in the cache before this test.

    clear('S'), tic, S  = read_huge_CRLF_5('c:\...\huge_CRLF_5.txt'); toc
    Elapsed time is 19.284555 seconds.
    clear('S'), tic, S  = read_huge_CRLF_5('c:\...\huge_CRLF_5.txt'); toc
    Elapsed time is 17.210736 seconds.

where read_huge_CRLF_5 is

    function    S = read_huge_CRLF_5( file_spec )
        if nargin == 0
            file_spec  = 'c:\MyData\Test\huge_CRLF_Sample.txt';
        end
        n_block_header_row  = 1;
        str_buf             = fileread( file_spec );
        ix_char_timestamp   = strfind( str_buf, 'Timestamp:' );
        ix_char_start_line  = [ 1, strfind( str_buf, char([13,10]) ) + 2 ];
        is_char_start_block = ismember( ix_char_start_line ...
                                      , ix_char_timestamp  ); 
        clear('str_buf')
        ii_line_start_block = ( 1 : 1 : size( ix_char_start_line, 2 ) );
        ii_line_start_block( not( is_char_start_block ) ) = [];
        n_block = numel( ii_line_start_block ); 
        n_col   = 4; 
        frmt    = cat( 2, '%s', repmat( '%f', [ 1, n_col-1 ] ) );
        S   = struct( 'BlockHeader' , cell( 1, n_block )    ...
                    , 'Data'        , cell( 1, n_block )    ...
                    , 'Time'        , cell( 1, n_block )    ...
                    );
        fid = fopen( file_spec', 'r' );
        cup = onCleanup( @() fclose(fid) );
        if ii_line_start_block(1) >= 2
            textscan( fid, '%s', ii_line_start_block(1)-1 ...
                        ,   'Delimiter', '\n'             ); 
        end
        iiBlock = 0;
        while not( feof( fid ) )
            iiBlock     = iiBlock + 1;
            if iiBlock == n_block 
                n_data_row = inf;
            else
                n_data_row  = ii_line_start_block( iiBlock+1 ) ...
                            - ii_line_start_block( iiBlock )   ...
                            - n_block_header_row               ; 
            end
            S(iiBlock).Time  = sscanf( fgetl(fid), 'Timestamp:%s' );
            cac = textscan( fid                             ...
                        ,   frmt            , n_data_row    ...
                        ,   'CollectOutput' , true          ...
                        );
            S(iiBlock).BlockHeader  = cac{1}; 
            S(iiBlock).Data         = cac{2};
        end
    end

Comments

clear('str_buf') frees memory. It is obvious faster to read from the file cache than to use the string buffer.
"'Delimiter', ''" places the "file position indicator" right of the EOL characters, whether they are CRLF or LF and whether the the "t" is added to the "permission string" or not.
fgetl(fid) places always the "file position indicator" right of the EOL characters.

MATLAB: How to ignore or delete the last row of a text file when importing

It’s difficult to be specific without your file. If you know how many lines you want to read, you can set that limit as a parameter.

From the documentation:

C = textscan(fileID,formatSpec,N) reads file data using the formatSpec N times, where N is a positive integer. To read additional data from the file after N cycles, call textscan again using the original fileID. If you resume a text scan of a file by calling textscan with the same file identifier (fileID), then textscan automatically resumes reading at the point where it terminated the last read.

Best Answer

Related Solutions

MATLAB: Parsing Formatted Text File Quickly

MATLAB: How to ignore or delete the last row of a text file when importing

Related Question