MATLAB: How to read rows of data in .csv file and turn them into arrays

blockcsvperformancespeedsscanftextscan

I have some data in a csv file that consists of some "delays" and for each delay there is a row of "wavelengths" and a row of corresponsing "intensities". It looks like this in excel:

I want the code to take all the delays and make them into one vector. And take the wavelengths (all the wavelength rows are identical) and make them into one vector. And take the intensities and make an array of intensities.

I tried textscan but that seems column oriented. I need matlab to read line by line (row by row), extract the data and put it into relevant array and then repeat procedure for every delay.

Best Answer

Notes to consider when importing a large file:

Importing as character and then doing any kind of text manipulation before converting to numeric will be very inefficient for a large file.
Importing the file in blocks is only efficient for a relatively small number of very large data blocks. This is not the case with your data: your file has many small blocks, forcing you to access the file repeatedly and expand the output array many times.
Regardless of what method you use to import it, the entire imported file data must fit into MATLAB memory. If this is not the case then you will need to use tall arrays or some other approach to processing your data.
Regardless of what method you use to import it, importing data will take some time.
If you want to import a large file efficiently then you need to minimize any handling of the imported file data, the optimal solution is to get all of the importing and conversion to numeric performed in one call to one reasonably low-level MATLAB command (i.e. avoid readtable , readmatrix, et al, which are just complex wrappers around textscan).

Method one: textscan. With suitable options to read that file format (note how ':' is defined as the EOL character, so each data block become one line of the imported matrix), you only need to call textscan once and it will import the entire file:

[fid,msg] = fopen('20200107LaAlO3.csv','rt');
assert(fid>=3,msg)
%% Detect the number of columns:
fgetl(fid); % ignore first line.
nmc = 1+nnz(fgetl(fid)==','); % number of columns.
fmt = ['%f%f',repmat('%f',1,2*nmc),'%*s'];
%% Import the file data:
opt = {'HeaderLines',1, 'EndOfLine',':', 'MultipleDelimsAsOne',true,...
    'Whitespace',', \n\r\t', 'CollectOutput',true}; % COLLECTOUTPUT is optional
frewind(fid)
C = textscan(fid,fmt,opt{:});
fclose(fid);
%% Optional: split into separate arrays:
D = C{1}(:,1:2); % Delays

W = C{1}(:,3:2+nmc); % Wavelengths
I = C{1}(:,3+nmc:end); % Intensities

And checking the output:

>> D % Delays
D =
          -20      -19.999
      -19.985      -19.982
       -19.97      -19.967
      -19.955      -19.952
       -19.94      -19.937
      -19.925      -19.922
       -19.91      -19.907
      -19.895      -19.892
       -19.88      -19.877
      -19.865      -19.862
>> W(:,[1:5,end-5:end]) % Wavelengths (first and last few columns only)
ans =
       189.25       189.71       190.18       190.65       191.12       1098.3       1098.7       1099.1       1099.5       1099.9       1100.4
       189.25       189.71       190.18       190.65       191.12       1098.3       1098.7       1099.1       1099.5       1099.9       1100.4
       189.25       189.71       190.18       190.65       191.12       1098.3       1098.7       1099.1       1099.5       1099.9       1100.4
       189.25       189.71       190.18       190.65       191.12       1098.3       1098.7       1099.1       1099.5       1099.9       1100.4
       189.25       189.71       190.18       190.65       191.12       1098.3       1098.7       1099.1       1099.5       1099.9       1100.4
       189.25       189.71       190.18       190.65       191.12       1098.3       1098.7       1099.1       1099.5       1099.9       1100.4
       189.25       189.71       190.18       190.65       191.12       1098.3       1098.7       1099.1       1099.5       1099.9       1100.4
       189.25       189.71       190.18       190.65       191.12       1098.3       1098.7       1099.1       1099.5       1099.9       1100.4
       189.25       189.71       190.18       190.65       191.12       1098.3       1098.7       1099.1       1099.5       1099.9       1100.4
       189.25       189.71       190.18       190.65       191.12       1098.3       1098.7       1099.1       1099.5       1099.9       1100.4
>> I(:,[1:5,end-5:end]) % Intensities (first and last few columns only)
ans =
        3802        2566        2801        2749        2918        2938        2931        2756        2760        2758        2756
        3805        2570        2806        2748        2922        2944        2917        2754        2755        2755        2754
        3806        2569        2800        2751        2925        2960        2918        2757        2760        2758        2756
        3805        2564        2802        2747        2924        2946        2915        2754        2757        2753        2757
        3804        2568        2801        2744        2928        2937        2914        2754        2759        2758        2760
        3805        2568        2807        2748        2935        2939        2928        2752        2762        2756        2756
        3802        2569        2802        2752        2933        2936        2910        2755        2758        2755        2757
        3803        2568        2804        2747        2927        2953        2920        2754        2761        2752        2758
        3803        2571        2802        2746        2923        2936        2927        2752        2759        2757        2762
        3810        2567        2807        2749        2931        2945        2918        2757        2755        2758        2760
>>

It might be more efficient to not use the 'CollectOutput' option, and concatenate the vectors afterwards, I leave that up to you to experiment with (hint: use comma-separated lists or cell2mat).

Method two: sscanf. This is by far the fastest method that I found to import that file data, helped by the fact that sscanf is as low-level as it gets. Note that the imported character data is not manipulated, split, or altered in any way before being converted by sscanf:

str = fileread('20200107LaAlO3.csv');
nmb = nnz(str==':'); % number of blocks
nmc = 1+(nnz(str==',')/nmb-1)/2; % number of columns
fmt = repmat(',%f',1,nmc);
fmt = ['%*[\n\r d]elay:%f,%f',fmt(2:end),fmt(2:end)];
mat = sscanf(str,fmt,[2+2*nmc,nmb]).';

The output matrix is exactly identical to the output of Method one:

>> isequal(mat,C{1})
ans =
     1

But it is faster:

Elapsed time is 0.599085 seconds. % Method one
Elapsed time is 0.111845 seconds. % Method two

Caveat: As per isakson noted at the end of their answer, importing this file data will lose information as the file contains more digits precision than can be stored in a Double floating point number (but as this is only at the 18th/19th digit this is very unlikely to be a problem).

Best Answer

Related Solutions

MATLAB: How to use fprintf to show collated values for two arrays.

MATLAB: Read two lines from a file and repeat for next lines

Related Question