[Comment on textscan deleted.]
Below are three function, which read your file. You might want to modify a function so that text_file_name and number_of_data_columns are input arguments.
The functions return a structure array with one element per data block in the file.
I use a three year old vanilla Dell, with R2012a,64bit,Win7.
.
--- Reads the whole file to a string buffer and parses it in a second step ---
My test below returns a 0.9GB structure in 23 seconds.
>> tic, S = read_huge_CRLF_1(); toc
Elapsed time is 22.613608 seconds.
>> S
S =
1x6060 struct array with fields:
RowHeader
Data
Time
>> S(1)
ans =
RowHeader: {1042x1 cell}
Data: [1042x3 double]
Time: '2012-09-06'
>> whs = whos('S');
>> whs.bytes/1e9
ans =
0.9114
where read_huge_CRLF_1
function S = read_huge_CRLF_1()
str_buf = fileread( 'c:\MyData\Test\huge_CRLF.txt' );
ix_list = strfind( str_buf, 'Timestamp:' );
n_block = numel( ix_list );
n_col = 4;
frmt = cat( 2, '%s', repmat( '%f', [ 1, n_col-1 ] ) );
S = struct( 'RowHeader' , cell( 1, n_block ) ...
, 'Data' , cell( 1, n_block ) ...
, 'Time' , cell( 1, n_block ) ...
);
for ii = 1 : n_block
ix1 = ix_list(ii);
if ii == n_block
buf = str_buf( ix1 : end );
else
buf = str_buf( ix1 : ix_list(ii+1)-1 );
end
S(ii).Time = sscanf( buf, 'Timestamp:%s', 1 );
cac = textscan( buf , frmt ...
, 'CollectOutput' , true ...
, 'HeaderLines' , 1 ...
);
S(ii).RowHeader = cac{1};
S(ii).Data = cac{2};
end
end
and where C:\MyData\Test\huge_CRLF.txt is 0.24GB and contains row like
Header
Timestamp: 2012-09-06 01:15
row1 3.1416 6.2832 9.4248
row1 3.1416 6.2832 9.4248
Timestamp: 2012-09-06 01:15
row1 3.1416 6.2832 9.4248
row1 3.1416 6.2832 9.4248
row1 3.1416 6.2832 9.4248
Timestamp: 2012-09-06 01:15
row1 3.1416 6.2832 9.4248
row1 3.1416 6.2832 9.4248
row1 3.1416 6.2832 9.4248
row1 3.1416 6.2832 9.4248
row1 3.1416 6.2832 9.4248
row1 3.1416 6.2832 9.4248
.
Comments
- It possible to increase the speed somewhat.
- The cost of "'Headerlines', 1" is surprisingly high. That might say more about me than than about textscan:).
.
--- Afterthought ---
The state of the file cache was not well defined when the test above was performed. Thus, I ran the test three times in a row directly after restart of the computer. I assume the text file is not in the cache. There is no real difference.
>> tic, S = read_huge_CRLF_1(); toc
Elapsed time is 23.249289 seconds.
>> tic, S = read_huge_CRLF_1(); toc
Elapsed time is 24.461189 seconds.
>> tic, S = read_huge_CRLF_1(); toc
Elapsed time is 23.933955 seconds.
.
--- Reading blocks from cache is 20% faster. Doesn't read block_header ---
In this test testscan reads the blocks of data from the Windows' file cache. In the previous test testscan reads from a string buffer in the function workspace. The text file, huge_CRLF_4.txt, which is a copy of huge_CRLF.txt, was not in the cache before this test.
>> clear('S'), tic, S = read_huge_CRLF_3('c:\...\huge_CRLF_4.txt'); toc
Elapsed time is 18.926222 seconds.
>> clear('S'), tic, S = read_huge_CRLF_3('c:\...\huge_CRLF_4.txt'); toc
Elapsed time is 17.150977 seconds.
>> clear('S'), tic, S = read_huge_CRLF_3('c:\...\huge_CRLF_4.txt'); toc
Elapsed time is 17.077009 seconds.
>>
where read_huge_CRLF_3 is
function S = read_huge_CRLF_3( file_spec )
if nargin == 0
file_spec = 'c:\MyData\Test\huge_CRLF_Sample.txt';
end
n_block_header_row = 1;
str_buf = fileread( file_spec );
ix_char_timestamp = strfind( str_buf, 'Timestamp:' );
ix_char_start_line = [ 1, strfind( str_buf, char([13,10]) ) + 2 ];
is_char_start_block = ismember( ix_char_start_line ...
, ix_char_timestamp );
ii_line_start_block = ( 1 : 1 : size( ix_char_start_line, 2 ) );
ii_line_start_block( not( is_char_start_block ) ) = [];
n_block = numel( ii_line_start_block );
n_col = 4;
frmt = cat( 2, '%s', repmat( '%f', [ 1, n_col-1 ] ) );
S = struct( 'BlockHeader' , cell( 1, n_block ) ...
, 'Data' , cell( 1, n_block ) ...
);
fid = fopen( file_spec', 'r' );
if ii_line_start_block(1) >= 2
cac = textscan( fid, '%[^\n\r]', ii_line_start_block(1)-1 );
end
iiBlock = 0;
while not( feof( fid ) )
iiBlock = iiBlock + 1;
if iiBlock == n_block
n_data_row = inf;
else
n_data_row = ii_line_start_block( iiBlock+1 ) ...
- ii_line_start_block( iiBlock ) ...
- n_block_header_row ;
end
cac = textscan( fid ...
, frmt , n_data_row ...
, 'CollectOutput' , true ...
, 'HeaderLines' , n_block_header_row ...
);
S(iiBlock).BlockHeader = cac{1};
S(iiBlock).Data = cac{2};
end
fclose( fid );
end
Comments
The functions works as far as I can tell. However, this construct is erroneous
fid = fopen( file_spec', 'r' );
...
cac = textscan( fid, '%[^\n\r]', ii_line_start_block(1)-1 );
Adding "t" to the permission string, i.e.
fid = fopen( file_spec', 'rt' );
does not solve the problem in my case. EOL is CRLF and the pointer will be positioned between the CR and LF. One solutions is adding "'Delimiter', '\n'" to the argument list of textscan.
.
--- My final function to read the file ---
Further refactored. The text file, huge_CRLF_5.txt, which is a copy of huge_CRLF.txt, was not in the cache before this test.
clear('S'), tic, S = read_huge_CRLF_5('c:\...\huge_CRLF_5.txt'); toc
Elapsed time is 19.284555 seconds.
clear('S'), tic, S = read_huge_CRLF_5('c:\...\huge_CRLF_5.txt'); toc
Elapsed time is 17.210736 seconds.
where read_huge_CRLF_5 is
function S = read_huge_CRLF_5( file_spec )
if nargin == 0
file_spec = 'c:\MyData\Test\huge_CRLF_Sample.txt';
end
n_block_header_row = 1;
str_buf = fileread( file_spec );
ix_char_timestamp = strfind( str_buf, 'Timestamp:' );
ix_char_start_line = [ 1, strfind( str_buf, char([13,10]) ) + 2 ];
is_char_start_block = ismember( ix_char_start_line ...
, ix_char_timestamp );
clear('str_buf')
ii_line_start_block = ( 1 : 1 : size( ix_char_start_line, 2 ) );
ii_line_start_block( not( is_char_start_block ) ) = [];
n_block = numel( ii_line_start_block );
n_col = 4;
frmt = cat( 2, '%s', repmat( '%f', [ 1, n_col-1 ] ) );
S = struct( 'BlockHeader' , cell( 1, n_block ) ...
, 'Data' , cell( 1, n_block ) ...
, 'Time' , cell( 1, n_block ) ...
);
fid = fopen( file_spec', 'r' );
cup = onCleanup( @() fclose(fid) );
if ii_line_start_block(1) >= 2
textscan( fid, '%s', ii_line_start_block(1)-1 ...
, 'Delimiter', '\n' );
end
iiBlock = 0;
while not( feof( fid ) )
iiBlock = iiBlock + 1;
if iiBlock == n_block
n_data_row = inf;
else
n_data_row = ii_line_start_block( iiBlock+1 ) ...
- ii_line_start_block( iiBlock ) ...
- n_block_header_row ;
end
S(iiBlock).Time = sscanf( fgetl(fid), 'Timestamp:%s' );
cac = textscan( fid ...
, frmt , n_data_row ...
, 'CollectOutput' , true ...
);
S(iiBlock).BlockHeader = cac{1};
S(iiBlock).Data = cac{2};
end
end
Comments
- clear('str_buf') frees memory. It is obvious faster to read from the file cache than to use the string buffer.
- "'Delimiter', ''" places the "file position indicator" right of the EOL characters, whether they are CRLF or LF and whether the the "t" is added to the "permission string" or not.
- fgetl(fid) places always the "file position indicator" right of the EOL characters.
Best Answer