MATLAB: Readtable for a huge (no good schema) CSV file!

csvdelimiter semi-colonhuge filereadtable

Hi All,

I have a tricky task of reading this huge csv file (1360×107037) . The example data is attached with this question and it just contains only few rows and columns. I tried importing it through the matlab function readtable.

What I did –

I tried to get the opts throught detectImportOptions()
As the delimiter is 'semi-colon', I assigned it to opts.Delimeter
Then I used readtable(filename, opts)
I got a table with 1360 rows and 107037 columns.

Problem –

I am getting in every even row for cells that supposed to be a string or datestring NaN values but all numerical values are retained
I tried to make the 7th line of the csv data as VariableNames but somehow I am getting Var1, Var2, …… etc are the VariableNames

Does anyone know how could I get rid of these NaN values and obtain the actual string if exists or just an empty string and make the 7th line of the CSV file as the VariableNames.

Update 1

I guess it is expecting a number at that point but instead it is receiving a String!

Update 2

I did this –

opts.VariableTypes(1, 1:end) = {'char'};

Now I am getting all the values but now all the numbers are casted to 'char'!

I dropped all th eampty columns;

table (:, all(ismissing(table))) =[];

I thought of saving the Table as .mat for later use but (my stupidity) it contains now all 'char' values and the file is really huge!

Any help is appreciated!

Thanks & Cheers,

Savan

Best Answer

As has been suggested you will have to parse the header and the values separately. It can't be done with the same readtable call as the data in your header is organised by rows whereas the values are organised by columns. In fact, the header is completely not suitable for readtable, so if it is indeed needed, reading the file line by line and using textscan is probably the best route:

fid = fopen('Example_CSV.csv', 'rt');  %open as text mode for automatic line ending conversion
%since textscan does not support timezones read everything in the first 7 rows as text
headers = cell(7, 1);
for l = 1:7
    tline = fgetl(fid);
    headers(l) = textscan(tline, '%s', 'Delimiter', ';', 'MultipleDelimAsOne', true);  %not sure you want that last option for your actual file
end
fclose(fid);
%conversion to datetime
for l = 2:4
    headers{l} = datetime(headers{l}, 'InputFormat', 'yyyy-MM-dd''T''HH:mm:ssZ', 'TimeZone', 'UTC+2');  %Choose whichever timezone you want for display
end
headers(5:6) = [];  %discard unwanted row 5 and 6

As for the data itself, with your example file, the following requires no extra parsing afterwards but may need adapting for your file with more columns:

opts = detectImportOptions('Example_CSV.csv', 'NumHeaderLines', 18, 'Delimiter', ';'); %Basic detection. If on R2019a or later, add 'ReadVariableNames', false
%correction for the date columns:
opts = opts.setvartype(1:2, 'datetime');  %proper type
opts = opts.setvaropts(1:2, 'InputFormat', 'yyyy-MM-dd''T''HH:mm:ssZ', 'TimeZone', 'UTC+2');  %proper decoding. Note: Set the timezone to whatever timezone you want it to display in
%remove unneeded variables
opts.SelectedVariableNames = opts.VariableNames([1:3, 6]);
opts.ExtraColumnsRule = 'ignore';
%read the data
data = readtable('Example_CSV.csv', opts);

Related Solutions

MATLAB: How to specify the column headers while reading in a CSV file using ‘readtable’

The 'VariableNamesLine' property can be used to specify this. Please refer to the example snippet below.

%Import the options of the csv file
opts=detectImportOptions('sample.csv');
%Defines the row location of channel variable name
opts.VariableNamesLine = 4;
%Specifies that the data is comma seperated

opts.Delimiter =','; %Specifies that the data is comma seperated
%Read the table
t = readtable('sample.csv',opts, 'ReadVariableNames', true);

As mentioned in the link below, the properties 'VariableNamesRange' and 'VariableNamesLine' can be specified only if the 'ReadVariableNames' argument is set to 'true'. Refer to the second subsection of 'Name-Value Pair Arguments' to learn more.

https://www.mathworks.com/help/releases/R2019a/matlab/ref/readtable.html

MATLAB: How to read and create column vectors from this txt file (attached)

A fixedWidthImportOptions object can help here altho the heading row doesn't quite line up to make it as trivial as could otherwise be...

opt=fixedWidthImportOptions('NumVariables',10);                 % create base import object w 10 variables
opt.VariableWidths=[8 10*ones(1,9)];                            % set field width to match file
opt.DataLines=[4 inf];                                          % data starts row 4
tout=readtable('output_nostri.txt',opt);                        % import data; will be cellstr arrays
for i=2:10                                                      % convert the numeric data
  tout.(tout.Properties.VariableNames{i})=str2double(strrep(tout{:,i},',','.'));
end
tout.Var1=datetime(tout.Var1,'InputFormat','MMM-yyyy');         % and the time

I presume local settings will handle the month name abbreviations; the above fails on all that aren't english translations here; I would presume that comes from a locale setting for other places but I've never had the opportunity to test to know if that is so or not.

Also leaves variable names Var1 thru Var10; the column headings in line two don't exactly align with the data columns so to read it would have to do a second read of that line...

Best Answer

Related Solutions

MATLAB: How to specify the column headers while reading in a CSV file using ‘readtable’

MATLAB: How to read and create column vectors from this txt file (attached)

Related Question