MATLAB: How to estimate the time required by textscan and the size of the output

big fileMATLABperformancesize;speedtextscan

Hello,
I am running Matlab 2013b on Windows 7. I have 8 GB RAM memory and a I set the swap file to 20 GB.
I am trying to read a relatively large txt file that is tab separated. The size of the file on the hard disk is a little over 2 GB. There are 6 columns and approx. 64 million rows in the file. The entries are mixed (strings and numbers with some missing values).
At this point I am using:
textscan(fid,repmat('%s',1,6),'delimiter','\t');
It is running for about 4 hours now using about 6.5 GB RAM.
1. I would like to know how can I estimate the time it takes to read the file and the size of the output.
2. After it is done I would like to extract the numerical values from the resulting cell matrix and save that to a .mat file. Any idea how long that would take?
3. Is there any better way of doing this? If I could extract from the file a matrix with the numerical values only (setting everything else to NaN) it would be great.
Thanks!

Best Answer

I made an experiment with R2013a, 64bit, Win7, 8GB RAM, 9GB page file size and a mechanical HD
  • created a file with "6 columns and approx. 64 million rows"
  • read a piece of the file with textscan after restart of Matlab
  • Monitored the memory usage with the Windows Task Manager
Elapsed time is 192.597879 seconds.
>> cac{1}(1:3)
ans =
'Col1'
'Col1'
'Col1'
>> cac{2}(1:3)
ans =
1
1
1
>> cac{6}(1:3)
ans =
3
3
3
>> whos cac
Name Size Bytes Class Attributes
cac 1x6 1920000672 cell
where code is
fid = fopen('c:\tmp\test.txt');
M = cumsum(ones( 3, 64e6 ), 1 );
fprintf( 'Col1\t%4.1f\tCol2\t%4.1f\tCol3\t%4.1f\n', M )
fclose( fid );
tic
fid = fopen('c:\tmp\test.txt');
cac = textscan( fid, '%s%f%s%f%s%f', 5e6, 'Delimiter', '\t' );
fclose( fid );
toc
  1. start of Matlab
  2. running of experiment
Results
  • Reading and parsing 5 million rows took three minutes and peaked at 4.8GB RAM usage
  • 5 million rows produced a 2GB variable, cac, in Matlab.
  • An experiment to read the entire file showed that speed decreased drastically when there was no more free physical RAM. (I killed the process.) 8GB RAM would allow effective reading of nearly ten million rows.