MATLAB: Textscan failing to read data in text file

data importMATLABtextscan

I have a text file with a fileID called fidRawData that contains rows that look like this:
A BCD 99.9 9.90 9.999 99.9 0.999 0.99 9.999 9.999 99.99 99.9 9.9
A can be one of two characters ('A' or 'B'), or it can be empty (a space is inserted in its place, leaving white space at the beginning of the row). The status of this first character can vary by row. BCD is a three letter code than can vary depending on the row. The subsequent columns of numbers I want to consider as being as general as possible, but none of them will ever get large. They should all be between -9999 and 9999.
Sometimes an error occurs and
---
is inserted in place of some of the numbers in a given row like this:
A BCD 99.9 9.90 9.999 --- --- 0.99 9.999 9.999 99.99 99.9 9.9
The only thing I can really be sure of is that there will always be one space between the columns. There may be more than one space. The numbers can vary depending on if they are positive or negative, where the decimal point is, and how large or small they are.
I need to use either textscan or fscanf (I would prefer to use textscan for its greater flexibility) to store all the data in each of these columns (including the textual information in the first two columns) in whatever data type will accept such a diverse range of simpler data types and allow me to easily retrieve the data.
Whenever and 'A' is omitted, and a ' ' is put in its place, I am ok with an 'N' or other character taking its place if need be, but if there is an 'A' or a 'B', I want that stored as 'A' or 'B' respectively.
When an '—' shows up, I want to replace that with NAN, an empty location in the data structure, or some other indication that there is no data available.
I tried the following command on a singular row where there was an 'A' at the beginning of the row and no '—' were in the row:
rawData = textscan(fidRawData, '%s %s %f %f %f %f %f %f %f %f %f %f)
This command worked as expected. It returned a 1×14 cell array where all the values in the text file were stored as I wanted in rawData.
But there are plenty of rows without and 'A' or 'B' and '—' is present at least once in the row. In order to try and address these variations, I tried the following on a row where both conditions are true:
rawData = textscan(fidRawData, '%s %s %f %f %f %f %f %f %f %f %f %f %f %f,'Delimiter',' ','EmptyValue',0)
This test results in a 1×14 cell array that is completely empty. The cells are either 1×1 cell type cells and contain a 0x0 char array, or they are 0x1 double cells.
rawData = textscan(fidRawData, '%s %s %f %f %f %f %f %f %f %f %f %f)
worked up until it hit the '—' in the row, then began returning 0x1 double cells for the remaining columns of rawData.
What can I do to get textscan to deal with these possibilities?

Best Answer

Here is one way. We pre-process the content before parsing, adding 'N' where the first letter is missing. Then we count the number of columns, split the content on white spaces, and reshape the output according to the number of columns. Finally we extract the header (or those first two char columns) and convert the rest to double.
content = fileread( 'data.txt' ) ;
content = regexprep( content, '^\s', 'N ', 'lineanchors' ) ;
nCols = numel( strsplit( regexp( content, '[^\r\n]+', 'match', 'once' ), ' ')) ;
data = reshape( regexp(content, '\s+', 'split'), nCols, [] ).' ;
header = data(:,1:2) ;
data = str2double( data(:,3:end) ) ;
Applied to the file attached, we get:
>> header
header =
5×2 cell array
{'A'} {'BCD'}
{'B'} {'BCD'}
{'N'} {'BCD'}
{'B'} {'BCD'}
{'N'} {'BCD'}
>> data
data =
99.9000 9.9000 9.9990 NaN NaN 0.9900 9.9990 9.9990 99.9900 99.9000 9.9000
99.9000 9.9000 9.9990 NaN NaN 0.9900 9.9990 9.9990 99.9900 99.9000 9.9000
99.9000 9.9000 9.9990 99.9000 0.9990 0.9900 9.9990 9.9990 99.9900 99.9000 9.9000
99.9000 9.9000 9.9990 NaN NaN 0.9900 9.9990 9.9990 99.9900 99.9000 9.9000
99.9000 9.9000 9.9990 NaN NaN 0.9900 9.9990 9.9990 99.9900 99.9000 9.9000