I want to put input my website's url and to find the url of links included in the home page. Is it possible??
MATLAB: Find links of website
linksurl
Related Solutions
When you start clicking on pages, the page ID is in the URL, e.g.
https://www.interactivebrokers.com/en/index.php?f=2222&exch=amex&showcategories=STK&p=&cc=&limit=100&page=17
you can see it as the last URL parameter. It is therefore easy to build the URL for a given page with SPRINTF e.g. in a loop..
urlBase = 'https://www.interactivebrokers.com/en/index.php?f=2222&exch=amex&showcategories=STK&p=&cc=&limit=100&page=' ; for pageId = 1 : 83 url = sprintf( '%s%d', urlBase, pageId ) ; html = urlread( url ) ; % Do something.
end
Then maybe you want to parse the HTML to get the table data, and you can use regular expressions for this. Training with page 1:
pageId = 1 ; url = sprintf( '%s%d', urlBase, pageId ) ; html = urlread( url ) ; pattern = ['>(?<ibSymbol>[^<]+)</td>\s*<td><a href="javascript:NewWindow\(''', ... '(?<externalUrl>[^'']+)[^>]+>(?<name>[^<]+)</a></td>\s*<td>(?<symbol>[^<]+)', ... '</td>\s*<td>(?<currency>[^<]+)'] ; data = regexp( html, pattern, 'names' ) ;
With that you get:
>> data data = 1×100 struct array with fields: ibSymbol externalUrl name symbol currency >> data(1) ans = struct with fields: ibSymbol: 'AT' externalUrl: 'https://misc.interactivebrokers.com/cstools/contract_info/index2.php?action=Details&site=G…' name: 'ATLANTIC POWER CORP' symbol: 'AT' currency: 'USD'
which is a struct array with the 100 entries of the table, including the URL of the page that you get in the popup window when you click on a product. So then you can work on parsing these pages:
html_ext = urlread( data(1).externalUrl ) ; pattern_ext = '...' ; data_ext = regexp( html_ext, pattern_ext, ... ) ;
I let you develop that part though! And putting everything together, you get a crawler/parser for the whole thing:
urlBase = 'https://www.interactivebrokers.com/en/index.php?f=2222&exch=amex&showcategories=STK&p=&cc=&limit=100&page=' ; pattern = ['>(?<ibSymbol>[^<]+)</td>\s*<td><a href="javascript:NewWindow\(''', ... '(?<externalUrl>[^'']+)[^>]+>(?<name>[^<]+)</a></td>\s*<td>(?<symbol>[^<]+)', ... '</td>\s*<td>(?<currency>[^<]+)'] ; pattern_ext = '...' ; for pageId = 1 : 83 url = sprintf( '%s%d', urlBase, pageId ) ; html = urlread( url ) ; data = regexp( html, pattern, 'names' ) ; for productId = 1 : numel( data ) html_ext = urlread( data(productId).externalUrl ) ; data_ext = regexp( html_ext, pattern_ext, ... ) ; % Do something. end end
That gives you a series of concepts/tools/examples that could be useful for what may come next in your developments.
PS: if you need to learn regular expressions in MATLAB, download the "MATLAB Programming Fundamentals" PDF document from
and go through the doc and examples on pages 2-42 to 2-73. It is a pretty good introduction/overview.
I figured out what the problem was. The value variable is an array with increasing in size each iteration. Thus, what I needed to do was specify value(end), like so: db_url = 'http://someurl/update.php?value='; db_url = strcat(db_url,num2str(value(end))); urlread(db_url); clear db_url
Best Answer