MATLAB: Using textscan with mixed data type in a single field/array

textscan mixed data textfile read

Hello,
I am having trouble reading a large (~30,000 rows) text file into Matlab. The data looks something like this:
BLOCK
1) 1996/01/01 00:00:00 -99.000N -99.000N
2) 1996/01/01 00:15:00 -99.000N -99.000N
3) 1996/01/01 00:30:00 -99.000N -99.000N
4) 1996/01/01 00:45:00 -99.000N -99.000N
5) 1996/01/01 01:00:00 -99.000N -99.000N
  • skipped rows
16455) 1996/06/20 09:30:00 -99.000N -99.000N
16456) 1996/06/20 09:45:00 -99.000N -99.000N
16457) 1996/06/20 10:00:00 -99.000N -99.000N
16458) 1996/06/20 10:15:00 1.869T 0.088T
16459) 1996/06/20 10:30:00 1.892 0.083
16460) 1996/06/20 10:45:00 1.913 -0.082
16461) 1996/06/20 11:00:00 1.913 -0.064
16462) 1996/06/20 11:15:00 1.895 0.035
I use textscan to read in the data like this:
textFilename = [year,SID,'.txt'];
fid = fopen(textFilename, 'rt');
C = textscan(fid, '%*s%d/%d/%d%d%c%d%c%d%f%c%f%c','Headerlines',11);
The problem (as you can see from the data) is some of the values in the last two columns contains a letter alongside it. As this doesn't apply to all rows, when I consider this letter as a character (%c), where it doesn't appear, textscan moves along and reads the '-' symbol from the next integer. Thus, the values from the fourth column are incorrectly read as positive where they are actually negative.
My question is that how can I tell textscan to read in the values from the last two columns whilst somehow separating the letters…
Any and all help greatly appreciated!
Ozgun

Best Answer

If you don't need N and T, the simplest approach is probably to eliminate them before the call to TEXTSCAN:
content = fileread( 'myFile.txt' ) ;
isNT = content == 'N' | content = 'T' ;
content(isNT) = ' ' ; % Replace with white space.
then you can TEXTSCAN type-homogeneous columns:
C = textscan( content, ... ) ;
Note the content variable as first argument, as TEXTSCAN accepts both file handles and strings. If you need N and T, we can talk about the post-processing mentioned in my comment above (no time now, but I'll come back later tonight).
If you wanted to process the whole in one shot using REGEXP, here is an example, but keep in mind that REGEXP is overkill for this operation and will take more time to process than a basic TEXTSCAN.
content = fileread( 'myFile.txt' ) ;
% Build cell array of entries.
pattern = '([\d]+)\)\s+([\d\s:/]{19})\s+([\d\-.]+)([NT]?)\s+([\d\-.]+)([NT]?)' ;
tokens = regexp( content, pattern, 'tokens' ) ;
tokens = reshape( [tokens{:}], numel( tokens{1} ), [] ).' ;
% Convert columns into numeric, string, and time data.
numData = str2double( tokens(:,[1,3,5]) ) ; % Row ID, 1st coord, 2nd coord.
strData = tokens(:, [4,6]) ; % 1st and 2nd N, T, or empty.
timData = datevec(tokens(:,2), 'yyyy/mm/dd HH:MM:SS' ) ;