MATLAB: Data extraction from .txt file

extractionnotepad

I have a LARGE minute data set in the format of:
10/21/2002,0601,0.97360,0.97360,0.97360,0.97360,0,0
10/21/2002,0602,0.97360,0.97360,0.97360,0.97360,0,0
10/21/2002,0603,0.97350,0.97350,0.97340,0.97350,0,0
10/21/2002,0604,0.97340,0.97340,0.97340,0.97340,0,0
10/21/2002,0605,0.97330,0.97330,0.97330,0.97330,0,0
10/21/2002,0606,0.97300,0.97310,0.97300,0.97310,0,0
10/21/2002,0607,0.97290,0.97290,0.97290,0.97290,0,0
10/21/2002,0608,0.97280,0.97280,0.97260,0.97260,0,0
10/21/2002,0609,0.97270,0.97270,0.97260,0.97260,0,0
This goes on till 2012.
I need two programs:
  1. Select a time period for which I need. So lets say 0600-0630, then the program would store IN ORDER all the 0600-0630 in one matrix in matlab.
  2. Select selected dates period and time period. So lets say I want 10/21/2002 0500 – 0700 and 17/21/2002 – 0500 – 0700.
Basically I want programs which will let me extract any time period and date period from the data. It is huge data so remember the program cant run for too long.
Look forward to some replies. Thanks in advance.

Best Answer

Check whether this approach is fast enough. SSCANF used to be faster than DATEVEC to convert date strings with identical format for all lines and only numbers. Might not be the case anymore.
I copied your data to cssm.dat.
fid = fopen( '.\cssm.dat', 'r' );
cac = textscan( fid, '%s%s%f%f%f%f%f%f', 'Delimiter', ',' );
sts = fclose( fid );
vec = zeros( numel( cac{1,1} ), 6 );
[ vec(:,1), vec(:,2), vec(:,3), ~, ~, ~ ] = datevec( cac{1,1}, 'mm/dd/yyyy' );
[ ~, ~, ~, vec(:,4), vec(:,5), ~ ] = datevec( cac{1,2}, 'HHMM' );
dat = [cac{1,3:end}];
m1 = 4;
m2 = 7;
d06 = dat( vec(:,4)==6 & vec(:,5)>=m1 & vec(:,5)<=m2, : );
v06 = vec( vec(:,4)==6 & vec(:,5)>=m1 & vec(:,5)<=m2, : );
Use the same approach for your second case.
Rounding errors should not be a problem in this example since DATEVEC returns flint (floating integers)
FSCANF (with 64bit R2011a) seems to be twice(?) as fast
fid = fopen( '.\cssm.dat', 'r' );
num = fscanf( fid, '%2u/%2u/%4u,%2u%2u,%f,%f,%f,%f,%f,%f', [11,inf] );
sts = fclose( fid );
num = transpose( num );
vec(:,1:5) = num(:,[3,2,1,4,5]);
vec(:,6) = 0;
dat = num(:,6:end);
However, this requires a bit more care and a file that adheres to the format.
The code above assumes that all data fits in memory.
Surprisingly, the code below is another 20% faster (with 64bit R2011a). I see three advantages with this code. Easy to handle header lines. Easier do debug when the text file is "corrupted". One additional line of code will handle comma as decimal separator. And it's a bit faster.
fid = fopen( '.\cssm.dat', 'r' );
cac = textscan( fid, '%s', 'Whitespace','', 'Delimiter','\n' );
sts = fclose( fid );
str = transpose( char( cac{:} ) );
str = cat( 1, str, repmat( ',', 1, size(str,2) ) );
num = sscanf( str, '%2u/%2u/%4u,%2u%2u,%f,%f,%f,%f,%f,%f,', [11,inf] );
vec(:,1:5) = num(:,[3,2,1,4,5]);
vec(:,6) = 0;
dat = num(:,6:end);