MATLAB: Help with REGEXP: extracting info from a fragment of URL inside the HTML code.

html parserMATLABpros and consregexpurlreadwebread

Hey guys, I have used webread/urlread to get info from this site, the outcome is huge but I'm only interested in these lines:

     <li class=''><a href='/en/index.php?f=2222&exch=IBIS&showcategories=STK&p=&cc=&limit=100&page=-1'> < </a></li>
     <li class=''><a href='/en/index.php?f=2222&exch=IBIS&showcategories=STK&p=&cc=&limit=100&page=1'>1</a></li>
     <li class=''><a href='/en/index.php?f=2222&exch=IBIS&showcategories=STK&p=&cc=&limit=100&page=2'>2</a></li>
     <li class=''><a href='/en/index.php?f=2222&exch=IBIS&showcategories=STK&p=&cc=&limit=100&page=3'>3</a></li>
     <li class=''><a href='/en/index.php?f=2222&exch=IBIS&showcategories=STK&p=&cc=&limit=100&page=4'>4</a></li>
     <li class=''><a href='/en/index.php?f=2222&exch=IBIS&showcategories=STK&p=&cc=&limit=100&page=5'>5</a></li>
     <li class='disabled'><span>...</span></li>
     <li><a href='/en/index.php?f=2222&exch=IBIS&showcategories=STK&p=&cc=&limit=100&page=22'>22</a></li>

If you notice, there's a 'segment' from the main url included in this part of the HTML code (this one: /en/index.php?f=2222&exch=IBIS&showcategories=STK&p=&cc=&limit=100&page=5). From this, I'd like to get the numbers at the very end of this fragment, or the numbers between the >< symbols (like 1, 2, 3, 4, 5 and 22).

I tried this foolishly thinking it was going to help but it didn't:

url='https://www.interactivebrokers.com/en/index.php?f=2222&exch=IBIS&showcategories=STK&p=&cc=&limit=100&page=';
pattern='/en/index.php?f=2222&exch=IBIS&showcategories=STK&p=&cc=&limit=100&page=[1-9]';
[a1, a2]=regexp(url, pattern,'match');

But it didn't work. Do you have any suggestions for this one? I previously tried '<li[^>]*><a[^>]*>(.*?)</a></li>' and 'tokens' option and although it captures these values, it also captures a lot of stuff I don't want.

Thanks for your help!

Best Answer

"Keep in mind that regular expressions are not a robust or neat way to parse HTML:" Anyhow, it can be used as an exercise on regular expressions.

>> cssm('h:\m\cssm\cssm.txt')
ans =
     1     2     3     4     5    22

where

function    num = cssm( ffs )
    str = fileread( ffs );
    xpr = '/en/index.php?f=2222&exch=IBIS&showcategories=STK&p=&cc=&limit=100&page=';
    xpr = regexptranslate( 'escape', xpr );
    xpr = ['(?<=',xpr,'\d+''>)\d+(?=<)'];
    cac = regexp( str, xpr, 'match' );
    num = str2double( cac );
end

and where h:\m\cssm\cssm.txt contains the html-code of the question.

The length of the look-behind-text varies because of the expression, '\d+', which may hamper performance.

Related Solutions

MATLAB: I want to extract the page buttons/widgets in a website using URLREAD.

When you start clicking on pages, the page ID is in the URL, e.g.

 https://www.interactivebrokers.com/en/index.php?f=2222&exch=amex&showcategories=STK&p=&cc=&limit=100&page=17

you can see it as the last URL parameter. It is therefore easy to build the URL for a given page with SPRINTF e.g. in a loop..

 urlBase = 'https://www.interactivebrokers.com/en/index.php?f=2222&exch=amex&showcategories=STK&p=&cc=&limit=100&page=' ;
 for pageId = 1 : 83
    url  = sprintf( '%s%d', urlBase, pageId ) ;
    html = urlread( url ) ;
    % Do something.

 end

Then maybe you want to parse the HTML to get the table data, and you can use regular expressions for this. Training with page 1:

 pageId = 1 ;
 url    = sprintf( '%s%d', urlBase, pageId ) ;
 html   = urlread( url ) ;
 pattern = ['>(?<ibSymbol>[^<]+)</td>\s*<td><a href="javascript:NewWindow\(''', ...
    '(?<externalUrl>[^'']+)[^>]+>(?<name>[^<]+)</a></td>\s*<td>(?<symbol>[^<]+)', ...
    '</td>\s*<td>(?<currency>[^<]+)'] ;
 data = regexp( html, pattern, 'names' ) ;

With that you get:

 >> data
 data = 
  1×100 struct array with fields:
    ibSymbol
    externalUrl
    name
    symbol
    currency
 >> data(1)
 ans = 
  struct with fields:
       ibSymbol: 'AT'
    externalUrl: 'https://misc.interactivebrokers.com/cstools/contract_info/index2.php?action=Details&site=G…'
           name: 'ATLANTIC POWER CORP'
         symbol: 'AT'
       currency: 'USD'

which is a struct array with the 100 entries of the table, including the URL of the page that you get in the popup window when you click on a product. So then you can work on parsing these pages:

 html_ext = urlread( data(1).externalUrl ) ;
 pattern_ext = '...' ;
 data_ext = regexp( html_ext, pattern_ext, ... ) ;

I let you develop that part though! And putting everything together, you get a crawler/parser for the whole thing:

 urlBase = 'https://www.interactivebrokers.com/en/index.php?f=2222&exch=amex&showcategories=STK&p=&cc=&limit=100&page=' ;
 pattern = ['>(?<ibSymbol>[^<]+)</td>\s*<td><a href="javascript:NewWindow\(''', ...
    '(?<externalUrl>[^'']+)[^>]+>(?<name>[^<]+)</a></td>\s*<td>(?<symbol>[^<]+)', ...
    '</td>\s*<td>(?<currency>[^<]+)'] ;
 pattern_ext = '...' ;
 for pageId = 1 : 83
    url  = sprintf( '%s%d', urlBase, pageId ) ;
    html = urlread( url ) ;
    data = regexp( html, pattern, 'names' ) ;
    for productId = 1 : numel( data )
       html_ext = urlread( data(productId).externalUrl ) ;
       data_ext = regexp( html_ext, pattern_ext, ... ) ;
       % Do something.
    end
 end

That gives you a series of concepts/tools/examples that could be useful for what may come next in your developments.

PS: if you need to learn regular expressions in MATLAB, download the "MATLAB Programming Fundamentals" PDF document from

https://www.mathworks.com/help/pdf_doc/matlab/index.html

and go through the doc and examples on pages 2-42 to 2-73. It is a pretty good introduction/overview.

MATLAB: Is there a way to pull a specific link after using webread() to get the content from a page

I think I've solved it by putting '\S+' in the expression and '?=&sa'. That way the expression will match all the characters following 'https?://en' but stop at the right point.

 regexp(content,'https?://en.\S+(?=&(amp);sa)','match')

This will find everything up until the '&(amp);sa'! If there's a more efficient way of doing this let me know!

Best Answer

Related Solutions

MATLAB: I want to extract the page buttons/widgets in a website using URLREAD.

MATLAB: Is there a way to pull a specific link after using webread() to get the content from a page

Related Question