MATLAB: Read specific data columns from a text file based on header name requested by user

MATLABtext file

Hello,

I have the matlab version 2018a.

I'm trying to extract specific columns of a text file based on the header name of the column. I have tried couple of different methods such as readtable, textscanf, etc. but, none of them exactly worked as I expected.

I have attached the text file itself. I'm trying to make sure the code I'm writing is not slow because there are 1000's of these files that I need to look into in a for-loop possibly.

The structure never changes but, the header columns can be in different positions and that's the reason why I want the code to find the header name no matter which position the column is in.

Here is a sample from the text file:

As it can be seen, the same dates are repeated below with different headers (information) and it is repeated 3-4 times in the actual text file. If I know how to pick up "WOPR – PROD1", "WOPR-PROD2", and "FOPT" columns and put them into a matrix in this order [WOPR-PROD1; WOPR-PROD2; FOPT] I can figure out the rest I believe. I prefer not to modify the text file itself if possible.

"--------""-----------""-----------""-----------""-----------""-----------""-----------""-----------""-----------""-----------"
"SUMMARY OF RUN Original_1                                                                                                        
"--------""-----------""-----------""-----------""-----------""-----------""-----------""-----------""-----------""-----------"
"DATE       ""YEARS      ""FOPR       ""FWPR       ""FGPR       ""FOPT       ""FGPT       ""FWPT       ""FWCT       ""FWIR       "
"           ""YEARS      ""STB/DAY    ""STB/DAY    ""MSCF/DAY   ""STB        ""MSCF       ""STB        ""           ""STB/DAY    "
"           ""           ""           ""           ""           ""           ""           ""           ""           ""           "
"           ""           ""           ""           ""           ""           ""           ""           ""           ""           "
"--------""-----------""-----------""-----------""-----------""-----------""-----------""-----------""-----------""-----------"
" 1JAN2009"         0            0            0            0            0            0            0            0            0     
" 1FEB2009"  0.084873            0            0            0            0            0            0            0            0     
" 1MAR2009"  0.161533     2000.000     65.16867     1360.000     56000.00     38080.00     1824.723     0.031556            0     
" 1APR2009"  0.246407     2000.000     67.93040     1360.000     118000.0     80240.00     3906.001     0.032849            0     
" 1MAY2009"  0.328542     2449.850     53.91752     1665.898     191495.5     130216.9     5523.527     0.021535            0 
"--------""-----------""-----------""-----------""-----------""-----------""-----------""-----------""-----------""-----------"
"SUMMARY OF RUN Original_1                                                                                                                                                                                                                                                                                                                                                                                                                                                              "
"--------""-----------""-----------""-----------""-----------""-----------""-----------""-----------""-----------""-----------"
"DATE       ""FWIT       ""FGOR       ""FOIP       ""FWIP       ""FGIP       ""FPR        ""WOPR       ""WOPR       ""WOPR       "
"           ""STB        ""MSCF/STB   ""STB        ""STB        ""MSCF       ""PSIA       ""STB/DAY    ""STB/DAY    ""STB/DAY    "
"           ""           ""           ""*10**3     ""*10**3     ""*10**3     ""           ""           ""           ""           "
"           ""           ""           ""           ""           ""           ""           ""PROD1      ""PROD2      ""PROD3      "
"           ""           ""           ""           ""           ""           ""           ""           ""           ""           "
"--------""-----------""-----------""-----------""-----------""-----------""-----------""-----------""-----------""-----------"
" 1JAN2009"         0            0     31190.54     645456.1     21209.57     6553.930            0            0            0     
" 1FEB2009"         0            0     31190.54     645456.1     21209.57     6553.922            0            0            0     
" 1MAR2009"         0     0.680000     31134.54     645454.2     21171.49     6473.267            0            0            0     
" 1APR2009"         0     0.680000     31072.54     645452.2     21129.33     6394.598            0            0            0     
" 1MAY2009"         0     0.680000     30999.18     645450.7     21079.44     6296.722            0     1675.190            0

Any help is appreciated. Thank you.

Best Answer

Before I continue, I want to Thank @Bob Nbob and @Stephen Cobeldick for their work, suggestions and help.

I really appreciate everything you guys are doing for this community.

Whoever is interested in this post and was waiting for an answer.

It took me a while but, I have finally got it to work correctly. The code is a little long and I did not exactly choose good variable names or probably wrote comments detailed enough.

If I can write a more efficient code and anyone has suggestions I'll consider them.

I don't know if 0.027222 seconds (from beginning to end) is efficient enough for this kind of task: read 68 columns of info from a text file full of "pages" of columns with no delimiters between columns.

The output is a 1X68 column cell array called "storage" with 99X1 or 100X1 cell arrays inside each cell of 68 cells. The output can be changed to numeric value later after removing some of the "." at the very end of some numbers (look at the number "8713996." FOPT in first cell towards the last rows - there is no "0" after the decimal).

The output is also a cell array and can be converted to a different type of array by using cell2mat,etc. functions (which I have not used in this code).

%% Read txt file
% Reset all the variables
clear;
clc;
% Read the content of the text file into memory
content = fileread('Original_1.txt');
% Declare desired string occurences (columns) to create the storage cell
% (initially with an unknown size) and store specified columns
desired_string = ["FOPT", "FGPT", "FWPT", "WOPR", "WWPR", "WOPT", "WWPT", "WGPT", "WGPR", "WBHP", "WGOR" ];
% Create cell array to store the columns desired
count_string = count(content,desired_string );
storage = cell(1,count_string);
% Delete unnecessary strings and special characters for readibility (should
% be left with 130 characters "per line"-content variable in workspace is a 
% character vector of size 1*154485). Warning: New line and carriage return
% characters also need to be deleted to get 130 characters "per line".
new_content = regexprep(content, '"SUMMARY\s.*$|"-.*$|SUMMARY\s.*$|^\s+|\n|\r', '', 'lineanchors', 'dotexceptnewline');
% If first 13 characters containts the string 'DATE' insert '!' to beginning
% as a delimiter to separate into pages
search_string = {'"','DATE'};
first_13 = new_content(1:13);
if contains(first_13,search_string(1,1))
   new_str = insertBefore(new_content, '"DATE', '!');
   final_str = strsplit(new_str, '!');
elseif contains(first_13, search_string(1,2))
   new_str = insertBefore(new_content, 'DATE', '!');
   final_str = strsplit(new_str, '!');
end
% Search for a specific string and read all the columns
pages = length(final_str);
add = 13;
count = 0;
next_column = 0;
% For loop for reading each page of the content
for ii = 2:pages
    
    % Convert cell into character vector
    new_char_vec = char(final_str(1,ii));
    % Length of each character vector in each cell
    long = length(char(final_str(1,ii)));
    
    % Column number of each page!
    row_no = long/130;
    
    % Read first 130 characters, 13 characters at a time 10 times and 
    % match the desired string
    for i = 1:10
        
        row = 1 + count;
        column = 13 + count;
        testing = new_char_vec(1, (row:column));
        count = count + add;
        
        % if contains 'FOPT' %read all the columns related
        if contains(testing,desired_string)
            
            % Move to next column of storage cell when string is found
            next_column = next_column + 1;
            
            % Create cell array inside each column of storage cell array
            % row_no varies
            storage{1,next_column} = cell(row_no,1);
            % read_rows and read_columns are reset to row and column
            % everytime the desired_string is found
            read_rows = row;
            read_columns = column;
            
            % Store the the desired string column's first row
            storage{1,next_column}{1,1} = new_char_vec(1, (read_rows:read_columns));
            % Store each row of the desired string column (starting from 2nd row) in a for loop
                
                for jj = 2:(row_no)
                    read_rows = read_rows + 130;
                    read_columns = read_columns + 130;
                    storage{1,next_column}{jj,1} = new_char_vec(1, (read_rows:read_columns));
                end
            
        end
         
    end
    
    
% Reset count
count = 0;
end
% Clear all the unnecessary variables
clear add column content count count_string desired_string final_str first_13 i ii jj long new_char_vec new_content new_str next_column pages read_rows read_columns row row_no search_string testing

Related Solutions

MATLAB: Populating a cell using a loop

textscan returns a 1x55 cell array, where the format string specifies the 55. So your code:

A = textscan(fopen(d),'%f %f %f %f %f %f %f %f %f %f %f %f %f %f %s %s %f %f %f %f %f %f %f %f %f %f %f %f %f %f %f %f %f %f %f %f %f %f %f %f %f %f %f %f %f %s %s %f %f %f %f %f %f %f %f', 'Delimiter',',','Headerlines',1);

will result in a 1x55 cell array. Then a few lines later you write this:

b = A(n,:);

because A only has one row, as soon as n>1 this will be an error, because C only has one row, and so requesting anything from its (non-existent) second (or higher) row will be an error.

"I want b to be a cell with 55 columns but with n rows. one row for each data file that I run"

I really really really recommend that you don't do that: that would require putting scalar numeric data into the cells of a cell array, which just makes it much harder to work with numeric data. You should really keep the numeric data in numeric arrays, or use a table. Because you have a few columns which are character, this complicates the importing a little bit, but there are reasonable solutions which you should look at:

If you do not need those character columns then get textscan to ignore them with %*s in the format string. Then could trivially get textscan to collect all of the numeric data into one numeric array, using the CollectOutput option. Very simple, but you would lose some data.
Use the CollectOutput option to collect the data into arrays of matching types: this would give you one 1x5 cell array C, containing an Nx14 numeric array, an Nx2 char cell array, an Nx29 numeric array, an Nx2 char cell array, and an Nx8 numeric array (or whatever sizes that format string gives you). I recommend this option.
Use a table. These are a very convenient way for handling mixed data (e.g, numeric, char, categorical) and analyzing it. It has many powerful methods and operators for processing data by groups, and for statistical analyses.

Or your proposal:

If you really want to get all of your data into one cell array (which will make any numeric processing slow, inefficient and complex), then you will need to post-process the data after it has been imported, something like this: detect numeric columns, convert numeric columns to cell array containing numeric scalars (e.g. num2cell), then concatenate all into one cell array. I strongly advise you to avoid doing this.

MATLAB: Converting unformatted text to formatted text

I have assumed that the size of the resulting arrays are known

fid = fopen( 'c:\m\cssm\test4.txt' );
rows = textscan( fid, '%s', 'Delimiter', '\n' );
fclose( fid );
rows = rows{:};
str = 'RainflowCycleCounterHistogram';  % avoid magic number

len = length( str );
is_counter   = strncmp( str, rows, len ); 
counter_rows = rows( is_counter );
%








str = 'RainflowCycleMeanBreakpoints';  
len = length( str );
is_mean   = strncmp( str, rows, len ); 
mean_rows = rows( is_mean );
%
str = 'RainflowCycleRangeBreakpoints';  
len = length( str );
is_range   = strncmp( str, rows, len ); 
range_rows = rows( is_range );
%
counter_matrix = nan( 10, 10 );
for jj = 1 : length( counter_rows )
%    


    cac = textscan( counter_rows{jj}, '%*s%d%d%f'   ...
                ,   'Delimiter'          , ' []:'   ...
                ,   'MultipleDelimsAsOne', true     ); 
%            





    counter_matrix( cac{1}+1, cac{2}+1 ) = cac{3};  % one based      



end
mean_vector = nan( 1, 10 );
for jj = 1 : length( mean_rows )
%    
    cac = textscan( mean_rows{jj}, '%*s%d%f'        ...
                ,   'Delimiter'          , ' []:'   ...
                ,   'MultipleDelimsAsOne', true     ); 
%            
    mean_vector( 1, cac{1}+1 ) = cac{2};  % one based      
end
range_vector = nan( 1, 10 );
for jj = 1 : length( range_rows )
%    
    cac = textscan( range_rows{jj}, '%*s%d%f'        ...
                ,   'Delimiter'          , ' []:'   ...
                ,   'MultipleDelimsAsOne', true     ); 
%            
    range_vector( 1, cac{1}+1 ) = cac{2};  % one based      
end

&nbsp

or maybe better - no assumptions regarding sizes

fid = fopen( 'c:\m\cssm\test4.txt' );
rows = textscan( fid, '%s', 'Delimiter', '\n' );
fclose( fid );
rows = rows{:};
str = 'RainflowCycleCounterHistogram';  % avoid magic number
len = length( str );
is_counter   = strncmp( str, rows, len ); 
counter_rows = rows( is_counter );
%
str = 'RainflowCycleMeanBreakpoints';  
len = length( str );
is_mean   = strncmp( str, rows, len ); 
mean_rows = rows( is_mean );
%
str = 'RainflowCycleRangeBreakpoints';  
len = length( str );
is_range   = strncmp( str, rows, len ); 
range_rows = rows( is_range );
%
CRS = permute( char( counter_rows ), [2,1] );
cac = textscan( CRS, '%*s%f%f%f'                ...
            ,   'Delimiter'             , '[]: '...
            ,   'MultipleDelimsAsOne'   , true  ...
            ,   'CollectOutput'         , true  ); 
num = cac{1};          
% 

sz1 = min( num(:,1:2), [], 1 );
sz2 = max( num(:,1:2), [], 1 );
sz  = sz2-sz1+[1,1];
ix_linear = sub2ind( sz, num(:,1)+1, num(:,2)+1  ); % one based 

counter_matrix( ix_linear ) = num(:,3); 
counter_matrix = reshape( counter_matrix, sz );
MRS = permute( char( mean_rows ), [2,1] );
cac = textscan( MRS, '%*s%f%f'                  ...
            ,   'Delimiter'             , '[]: '...
            ,   'MultipleDelimsAsOne'   , true  ...
            ,   'CollectOutput'         , true  ); 
num = cac{1};                  
%            
mean_vector( num(:,1)+1 ) = num(:,2);  % one based

RRS = permute( char( range_rows ), [2,1] );
cac = textscan( RRS, '%*s%f%f'                  ...
            ,   'Delimiter'             , ' []:'...
            ,   'MultipleDelimsAsOne'   , true  ...
            ,   'CollectOutput'         , true  ); 
%            
range_vector( num(:,1)+1 ) = num(:,2);  % one based

hope they return identical results :-)

&nbsp

and another iteration

Comments:

A function is superior to a script. It doesn't mess with the base workspace. It's easier to debug and it's easier to call from a script or function.
This function is readable. It's fairly straightforward to add new keywords and row formats.
The switch case can be replaced by a feval construct. But why do that?
The subfunctions, f1, f2 and f3, have large parts of their code in common. That asks for further refactoring.
Allocating a separate sub-function to each type of row makes testing easier.
If speed becomes a problem analyze the code with the profiler.

>> S = cssm( 'c:\m\cssm\text4.txt' )
S = 
                   RainflowCycleCounterHistogram: [10x10 double]
                    RainflowCycleMeanBreakpoints: [-111 100 300 330 360 380 390 400 410 420]
                   RainflowCycleRangeBreakpoints: [0 35 70 100 135 170 200 230 260 300]
                  RainflowCycleReversalTolerance: 20
                        PowerCylinderTemperature: 0
               PowerCylinderTemperatureHistogram: [1x12 double]
    PowerCylinderTemperatureHistogramBreakpoints: [0 150 175 200 220 250 300 320 350 370 400]
>>

where

function    S = cssm( filespec )
    fid = fopen( filespec );
    rows = textscan( fid, '%s', 'Delimiter', '\n' );
    fclose( fid );
    rows = strtrim( rows{:} );
    type_list   = {
    ... format  keyword   
        'f1', 'RainflowCycleCounterHistogram'
        'f2', 'RainflowCycleMeanBreakpoints'
        'f2', 'RainflowCycleRangeBreakpoints'
        'f3', 'RainflowCycleReversalTolerance'
        'f3', 'PowerCylinderTemperature'
        'f2', 'PowerCylinderTemperatureHistogram'
        'f2', 'PowerCylinderTemperatureHistogramBreakpoints'    
        };
    for jj = 1 : size( type_list, 1 )
        switch type_list{jj,1}
            case 'f1'
                S.(type_list{jj,2}) = f1( type_list{jj,2}, rows );
            case 'f2'
                S.(type_list{jj,2}) = f2( type_list{jj,2}, rows );
            case 'f3'
                S.(type_list{jj,2}) = f3( type_list{jj,2}, rows );
            otherwise
                error( 'The format, "%s", is not yet implemented', type_list{jj,1} )
        end
    end
end
function    matrix = f1( keyword, rows )
    ism = is_member( keyword, rows );
    cur_rows = rows( ism );
    %
    str = permute( char( cur_rows ), [2,1] );
    cac = textscan( str, '%*s%f%f%f'                ...
                ,   'Delimiter'             , '[]: '...
                ,   'MultipleDelimsAsOne'   , true  ...
                ,   'CollectOutput'         , true  ); 
    num = cac{1};          
    % 
    sz1 = min( num(:,1:2), [], 1 );
    sz2 = max( num(:,1:2), [], 1 );
    sz  = sz2-sz1+[1,1];
    ix_linear = sub2ind( sz, num(:,1)+1, num(:,2)+1  ); % one based 
    matrix( ix_linear ) = num(:,3); 
    matrix = reshape( matrix, sz );
end
function    matrix = f2( keyword, rows )
    ism = is_member( keyword, rows );
    cur_rows = rows( ism );
    %
    str = permute( char( cur_rows ), [2,1] );
    cac = textscan( str, '%*s%f%f'                  ...
                ,   'Delimiter'             , '[]: '...
                ,   'MultipleDelimsAsOne'   , true  ...
                ,   'CollectOutput'         , true  ); 
    num = cac{1};                  
    %            
    matrix( num(:,1)+1 ) = num(:,2);  % one based      
end
function    matrix = f3( keyword, rows )
    ism = is_member( keyword, rows );
    cur_rows = rows( ism );
    %
    str = permute( char( cur_rows ), [2,1] );
    cac = textscan( str, '%*s%f', 'Delimiter',':' );
    matrix = cac{:};   
end
function    ism = is_member( keyword, rows )
    %   the keyword is followed by either ":" or "["
    cac = regexp( rows, ['^',keyword,'(?=(:|\[))'], 'once' );
    ism = not( cellfun( @isempty, cac ) );
end

Best Answer

Related Solutions

MATLAB: Populating a cell using a loop

MATLAB: Converting unformatted text to formatted text

Related Question