I am working on a text file with 22 GB consisting of around 180 million rows and 17 columns. It's transportation data consisting of "Trip" and "Speed". I want to calculate the average speed of each trip.
clear;clc;data='D:\Umich\eBlueBus_2019 Summer\UMTRI data\SQLDataExport.txt';% data='D:\Umich\eBlueBus_2019 Summer\Simulation\Trip_11to12_0702.txt';
ds = tabularTextDatastore(data);ds.ReadSize=200000;ds.SelectedVariableNames={'Trip','GpsSpeed'};ds.SelectedFormats={'%f','%f'};t_array=tall(ds);A=matlab.tall.transform(@reduce_fcn,t_array,'OutputsLike',{(table(1,1,'VariableNames',{'Trip','Spd'}))});
Where the @reduce_fcn is the function below:
function TT=reduce_fcn(t_array) [groups,Y]=findgroups(t_array.Trip); D=splitapply(@mean,t_array.GpsSpeed,groups); TT= table(Y,D,'VariableNames',{'Trip','Spd'});end
After I ran and commanded B=gather(A), I got an error:
>> B=gather(A);
Evaluating tall expression using the Local MATLAB Session:
– Pass 1 of 1: 12% complete
Evaluation 12% complete
Error using matlab.io.datastore.TabularTextDatastore/readData (line 77)
Mismatch between file and format character vector.
Trouble reading 'Numeric' field from file (row number 58461, field number 2) ==>
Trip,Time,GpsTime,GpsWeek,GpsHeading,GpsSpeed,Latitude,Longitude,Altitude,NumberOfSats,Differential,FixMode,Pdop,GpsBytes,UtcTime,UtcWeek\n
Learn more about errors encountered during GATHER.
Error in matlab.io.datastore.TabularDatastore/read (line 120)
[t, info] = readData(ds);
Error in tall/gather (line 50)
[varargout{:}, readFailureSummary] = iGather(varargin{:});
Caused by:
Reading the variable name 'Trip' using format '%f' from file: 'D:\Umich\eBlueBus_2019
Summer\UMTRI data\SQLDataExport.txt' starting at offset 2852126773.
I did check the row 58461 of the data, the Trip and GpsSpeed are the second column (i.e. 10) and the seventh column (i.e. 1.4299999). They don't seem like a problem to me. I am confused why reading t_array.Trip by %f would be a problem indicated in the error.
Besides, when I ran the test data file which is only a portion of the original data consisting only Trip 11 and 12 (i.e. the comment part of the code), I did not get any problem and have the answer I'd expected: a 2 by 2 table where the first column is Trip 11 & 12 and the second is average speed of the Trip 11 & 12.
Appreciate any suggestions. Thank you
Best Answer