MATLAB: Read in only certain columns of big text file

import data

I want to read in only certain columns of a big tab-separated values text document (.tsv) in the form of a table with 900 columns and 100 lines. Every line in the textfile has the format:
8 columns %s
112 repetitions of: %d%d%d%d%d%d%s
100 columns: %d (irrelevant to me)
How can I import the table as a cell array of dimension (number of columns)*(number of lines) (or vice versa) without having to specify all the 900 format specifier (& skipping the last 100 columns)?
In fact, I only need every 8+6*k column and every 8+6*(k+1) column for k=1:112, i.e. the last 2 elements (%d%s) of the sequence repeated 112 times.
Using textscan(fileid,'%s','Delimiter','\t'); gives me a cell array of size 1*(number of total elements) instead, which is not very practicle to deal with if I want to use certain columns. Also, I didn't know how to solve the format specifier issue and simply read everything as strings.
Using readtable('filename.tsv','Delimiter','\t'); gives me the error message: Undefined function 'readtable' for input arguments of type 'char'.

Best Answer

I'd do it as follows, assuming that what you don't want to do is to have to build the formatSpec by yourself. We read the first line of the file to identify non-numeric columns (string) and we use this information to build an appropriate formatSpec for TEXTSCAN. Then we can read the file and optionally convert the cell array of columns (mix of cell arrays and numeric arrays) into a large cell array.
filename = 'myFile.tsv' ;
% - Get structure from first line.
fid = fopen( filename, 'r' ) ;
line = fgetl( fid ) ;
fclose( fid ) ;
isStrCol = isnan( str2double( regexp( line, '[^\t]+', 'match' ))) ;
% - Build formatSpec for TEXTSCAN.
fmt = cell( 1, numel(isStrCol) ) ;
fmt(isStrCol) = {'%s'} ;
fmt(~isStrCol) = {'%f'} ;
fmt = [fmt{:}] ;
% - Read full file.
fid = fopen( filename, 'r' ) ;
data = textscan( fid, fmt, Inf, 'Delimiter', '\t' ) ;
fclose( fid ) ;
% - Optional: aggregate columns into large cell array.
for colId = find( ~isStrCol )
data{colId} = num2cell( data{colId} ) ;
end
data = [data{:}] ;
From there, it is easy to select relevant columns.
Note that this solution assumes that there are no white-spaces in columns. If it is not true, I can update the solution so it really works with tabs as separator (in fact, TEXTSCAN seems to use the white space as delimiter even when with specific only \t as delimiter).
Related Question