MATLAB: Extracting Data field of a Series in HTML file

extracthtmlMATLABparseText Analytics Toolbox

In an HTML file, there is a section like this :

        series: [{
            name: 'Numbers',
            color: '#33CCFF',
            lineWidth: 5,
            data: [45,78,84,91,111,125,178,231,274,283,303,333]        }],

How to extract the 'data' field into an array in a matlab code ?

There are many such series' in that same HTML file with different 'name' fields. For example, name: 'Total Value', 'Log Scale', 'Base Value' etc.

Best Answer

I misunderstood your question. This is a bit of overkill.

Assumptions

the string, series:, always indicates the start of a block of interest

I created a sample file, cssm.txt, which I uploaded. (Matlab Answers doesn't allow the extension .html ).

This script reads all blocks

%%

chr = fileread('cssm.txt');
cac = regexp( chr, '(?<=series\:)[^\}]+\}\],', 'match' );
%%
len = length( cac );
series(1,len) = struct( 'name','', 'color','', 'lineWidth',[], 'data',[] ); 
for jj = 1 : len
    
    txt = regexp( cac{jj}, '(?<=name\:)[^,]+', 'match', 'once' );
    txt(txt== '''') = [];
    series(jj).name = matlab.lang.makeValidName( txt );
    txt = regexp( cac{jj}, '(?<=color\:)[^,]+', 'match', 'once' );
    txt(txt== '''') = [];
    series(jj).color = txt;
    
    txt = regexp( cac{jj}, '(?<=lineWidth\:)[^},]+', 'match', 'once' );
    series(jj).lineWidth = str2double( txt );
    
    txt = regexp( cac{jj}, '(?<=data\:)[^}]+', 'match', 'once' );
    series(jj).data = str2num( txt );  %#ok<ST2NM>

end

and extract "series which matches name='Numbers'. Not the other series'."

>> series(strcmp({series.name},'Numbers')).data
ans =
    45    78    84    91   111   125   178   231   274   283   303   333

In response to comment below

Assumptions

the string, series:, always indicates the start of a block of interest
the string, }], indicates the end of a block of interest
all html-files of interest are named index.html
all files named index.html are of interest
all html-files of interest are in subfolders under a root-folder, ...\finCase
every html-file, index.html, contains exactly one block that has a specific value of the field name:, e.g. Numbers

The overkill is still there. However, reading and parsing four html-files (copies of cssm.txt ) takes less than 10ms.

Try

>> client_data = read_client_data( 'd:\m\cssm\finCase', 'index.html', 'Numbers' )
client_data =
  4×2 cell array
    {'anderson'       }    {1×9  double}
    {'kim-j-clijsters'}    {1×10 double}
    {'paul-judd'      }    {1×11 double}
    {'simmi'          }    {1×12 double}
>>

where (in one m-file)

function    client_data = read_client_data( root, file, name )
    
    sad = dir( fullfile( root, '**', file ) ); 
    len = length( sad );
    client_data = cell( len, 2 );
    for jj = 1 : len 
        cac = strsplit( sad(jj).folder, filesep );
        client = cac{end};
        series = read_one_file_( fullfile( sad(jj).folder, sad(jj).name ) );
        client_data(jj,:) = { client, series(strcmp({series.name},name)).data };
    end
end
function    series = read_one_file_( file )
    
    chr = fileread( fullfile( file ) );
    cac = regexp( chr, '(?<=series\:)[^\}]+\}\],', 'match' );
    
    len = length( cac );
    series(1,len) = struct( 'name','', 'color','', 'lineWidth',[], 'data',[] );
    
    for jj = 1 : len
        
        txt = regexp( cac{jj}, '(?<=name\:)[^,]+', 'match', 'once' );
        txt(txt== '''') = [];
        series(jj).name = strtrim( txt );
        
        txt = regexp( cac{jj}, '(?<=color\:)[^,]+', 'match', 'once' );
        txt(txt== '''') = [];
        series(jj).color = txt;
        
        txt = regexp( cac{jj}, '(?<=lineWidth\:)[^},]+', 'match', 'once' );
        series(jj).lineWidth = str2double( txt );
        
        txt = regexp( cac{jj}, '(?<=data\:)[^}]+', 'match', 'once' );
        series(jj).data = str2num( txt );  %#ok<ST2NM>
        
    end
end

TODO: add error handling and comments

Related Solutions

MATLAB: Fgetl, textscan, and the file position indicator

That series is valid, yes.

Did you open the file with 'rt' instead of 'r' in order to account for the CRLF ?

Also note that if you use a count for textscan() then the file position will be left after the last format code is used, before the newline for that line (unless the end of format matches newline)

MATLAB: How to get drive name

On Windows this function works with my local drives

>> DriveName( 'C' )
ans =
    'OSDisk'
>> DriveName( 'D' )
ans =
    'DATA'
>>

where

function    drive_name = DriveName( drive_letter ) 
    cmd_str = sprintf( 'dir %s:\\zzzzzz', drive_letter );
    [~,msg] = system( cmd_str );
    cac = strsplit( msg, '\n' );
    has = contains( cac, 'Volume in drive');
    drive_name = regexp( cac{has}, '(?<= is ).+$', 'match', 'once' );
end

I'm sure there are more robust solutions, see e.g. GetVolumeInformationA function

A bit better

function    drive_name = DriveName( drive_letter ) 
    cmd_str = sprintf( 'vol %s:', drive_letter );
    [~,msg] = system( cmd_str );
    cac = strsplit( msg, '\n' );
    drive_name = regexp( cac{1}, '(?<= is ).+$', 'match', 'once' );
end

Best Answer

Related Solutions

MATLAB: Fgetl, textscan, and the file position indicator

MATLAB: How to get drive name

Related Question