MATLAB: Extracting information from file

filereadregexpsscanftextscan

I am trying to extract information from the attached file and write them into a matrix with one column each from sample name, number of cells and porosity. I have been trying textscan and sscanf, but am not sure how to search the structure of the text.

Best Answer

This is one way to read the your file

>> tic, sas = nohup, toc
sas = 
1x1173 struct array with fields:
    SampleName
    NumOfCells
    Porosity
Elapsed time is 27.000338 seconds.
>> ix = find( strcmp( {sas.SampleName}, 'cutTDM050_111_121_221_222_122' ) )
ix =
   583
>> sas(ix).Porosity
ans =
    0.0828    0.0828    0.0828
>> sas(ix).NumOfCells
ans =
      125000      125000      125000

where (in one m-file)

function    sas = nohup
%%    
    str = fileread( 'nohup.txt' ); 
%%
    heading_string  = 'Running Sample';
    trailing_string = '=============================================='; 
    %
    xpr = sprintf( '(?<=%s).+?(?=%s)', heading_string, trailing_string );
    cac = regexp( str, xpr, 'match' );
%% 
    sas = struct( 'SampleName',repmat({''},[1,length(cac)]) ...
                , 'NumOfCells',{[]}, 'Porosity', {[]}       );
    for jj = 1 : length( cac )
        sas(jj) = nohup_( cac{jj} ); 
    end
end
function    sas = nohup_( str )
    %
    sas.SampleName ... 
    =   regexp( str, 'cutTDM\d{3}_\d{3}_\d{3}_\d{3}_\d{3}_\d{3}', 'match', 'once' );
    %
    cac = regexp( str, '(?<=Num of cells +\= *)\d+', 'match' ); 
    sas.NumOfCells = str2double( cac );
    %
    cac = regexp( str, '(?<=Porosity +\= *)[\d+\.]+', 'match' ); 
    sas.Porosity = str2double( cac );
end

&nbsp

Comments:

The function is slow. Nearly all the time is spend with regexp searching for "Num of cells" and "Porosity". "the Num of cells and porosity value are the same." may be used improve speed. Adding 'once' to these two calls of regexp increases the speed forty times. That's much more than I anticipated; I don't understand; I cannot see what's taking all the extra time.

>> tic, sas = nohup, toc
sas = 
1x1173 struct array with fields:
    SampleName
    NumOfCells
    Porosity
Elapsed time is 0.645206 seconds.
>> ix = find( strcmp( {sas.SampleName}, 'cutTDM050_111_121_221_222_122' ) )
ix =
   583
>> sas(ix).Porosity
ans =
    0.0828
>> sas(ix).NumOfCells
ans =
      125000
>>

Related Solutions

MATLAB: Read and search using Textscan

1. You can't. Simply write:

c = textscan(fid,'%s');
s = c{1};

2. strncmp:

lineIndex = strncmp(s, 'IM=', 3);  % Or; find(strncmp(...))
matchLine = s(lineIndex);

MATLAB: Importing cnv file with header lines

Given

the entire file fits in a fraction of the memory
*END* is the last line before the numerical part of the file. (This is the only occurrence of *END*.)
all files have 29 columns of numerical data

then one way is

str = fileread( 'd94i006.txt' );
str = regexp( str, '(?<=\*END\*\s+).+$', 'match' );
cac = textscan( str{1}, repmat('%f',[1,29]), 'CollectOutput',true );
num = cac{1};

result

>> whos num
  Name        Size            Bytes  Class     Attributes
  num       339x29            78648  double

Note: "has about 130 header lines." makes it safer to use the line *END*.

In response to comment "loop [...] every file from the folder?"

Try

>> cac = cssm( 'c:\your_folder\with_data\' );

where in one file

function    cac = cssm( folderspec )
    sas = dir( fullfile( folderspec, '*.txt' ) );
    len = length( sas );
    cac = cell( 1, len );
    for jj = 1 : len
        cac{jj} = cssm_( fullfile( folderspec, sas(jj).name ) );
    end
end
function    num = cssm_( filespec )
    str = fileread( filespec );
    str = regexp( str, '(?<=\*END\*\s+).+$', 'match' );
    cac = textscan( str{1}, repmat('%f',[1,29]), 'CollectOutput',true );
    num = cac{1};
end

As is; Not tested Needed: better names and some comments. There is a magic number, 29, in the code.

Best Answer

Related Solutions

MATLAB: Read and search using Textscan

MATLAB: Importing cnv file with header lines

Related Question