MATLAB: Regular expressions help with HTML source code

regexpregexpiurlread

I'm looking to parse through some HTML source code to pull information from the Wall Street Journal. I need to pull the price of the following commodities: the 4 domestic crude oil spot prices, copper, aluminum, cotton, and cocoa

This is the URL: http://online.wsj.com/mdc/public/page/2_3023-cashprices.html

I'm having some trouble with getting regexp to work the way I want it to.

what string expression would you use to pull out the middle (bold) price listed? If the value is n.a., it's okay if it just returns 'n.a.' or its equivalent.

I tried a variety of methods and I couldn't get it to work.

Could someone show an example of the string he or she would use for extracting the price?

Thanks!

Best Answer

Did you see my answer to your previous question? Tokens work well in such situations;

 >> buffer = urlread('http://online.wsj.com/mdc/public/page/2_3023-cashprices.html');
 >> item    = 'West Texas Intermediate, Cushing' ;
 >> pattern = [item, '.*?*(?<prefix>.*?)(?<price>[\d\.]*)*'] ; 
 >> tokens  = regexp(buffer, pattern, 'names') ;
 tokens = 
    prefix: ''
     price: '92.06'
 >> item    = 'London fixing, spot price' ;
 >> pattern = [item, '.*?*(?<prefix>.*?)(?<price>[\d\.]*)*'] ; 
 >> tokens  = regexp(buffer, pattern, 'names') ;
 tokens = 
    prefix: '&#163;'          % Code, but the forum renders it.
     price: '19.4273'

Cheers,

Cedric

Note that a . is returned for n.a. entries.

EDIT 1: corrected pattern thank to Walter's comment about pound-signs.

EDIT 2: updated with named tokens so we get the prefix (e.g. pound-sign).

Related Solutions

MATLAB: Regular expression to match “=”

regexp('a == b = d','(?<!=)=(?!=)')

This looks for = that are not preceded by = and not followed by =

MATLAB: How to automatically save and index data from an internet database at a specified interval

Most languages will allow you to extract data from the internet. Relevant questions might be..

Where to get data? Is it free, historical, real time, reliable, etc?
Is there an API available or do you need to parse web pages by yourself?
How much time can you afford spending in writing your own parser?
Is it meaningful to build some data logger when historical data are available?
If you use MATLAB, can you afford having it dedicated to data logging?

I would personally use Python at least for the data logging part, as this language usually minimizes the time to solution (and you might prefer investing your time in data analysis than in building some web crawler). There are plenty of libs for Python that will help you doing almost everything (I have seen many threads about that even though I've not been working on it myself). But more than that, I could not afford having MATLAB stuck with data extraction/logging a significant part of the day, everyday.

Now if you just want to play a little in MATLAB to see what you can do, it is not too difficult to build a simple code for extracting/logging data .. try the following for example:

Open http://www.google.com/finance?q=AAPL for having the Apple quote. 'AAPL' is the stock symbol and you see it appears in the URL. Open the source of the webpage (CTRL+u in Firefox) and lookup for the price (431.72 as I write). You'll find it at a place that will look like (with different numbers)

 values:["AAPL","Apple Inc.","431.72","+1.14","chg","0.26"

which is probably a good chunk of string for pattern matching (because it is close to the stock symbol).

Now in MATLAB, do the following:

 >> stockSymbol = 'AAPL' ;
 >> buffer = urlread(['http://www.google.com/finance?q=', stockSymbol]) ;

If you look at the content of buffer, you'll recognize the source code of the web page. So at this point you want to extract the quote based on pattern matching. You can achieve this with a regexp:

 >> pattern = ['values:["', stockSymbol, '",".*?","(?<price>[\d\.]*?)","(?<change>[+-\d\.]*?)".*?"(?<percent>[+-\d\.]*?)"'] ;
 >> quote = regexp(buffer, pattern, 'names')
 quote = 
      price: '431.72'
     change: '+1.14'
    percent: '0.26'

and voila! Then you can convert to double, store in a file, or anything else. I could describe a little better the pattern, but let's say for now that it is defined so it matches some static literal like "values:[", some literal again that is the stock symbol, and then the three numbers framed by double-quotes, comas, etc. Each part meant to match a number (including special characters like +-. when relevant) is saved as a named token. These tokens names are used to define the struct that is output-ed by regexp.

Wrapping the whole into a cute function, you get:

 function quote = getQuote_google(stockSymbol)
    buffer  = urlread(['http://www.google.com/finance?q=', stockSymbol]) ;
    pattern = ['values:["', stockSymbol, '",".*?","(?<price>[\d\.]*?)","(?<change>[+-\d\.]*?)".*?"(?<percent>[+-\d\.]*?)"'] ;
    quote   = regexp(buffer, pattern, 'names') ;
 end

that you can then easily use as follows:

 >> quote = getQuote_google('AAPL')
 quote = 
      price: '431.72'
     change: '+1.14'
    percent: '0.26'
 >> quote = getQuote_google('GOOG')
 quote = 
      price: '831.52'
     change: '-1.08'
    percent: '-0.13'

Best Answer

Related Solutions

MATLAB: Regular expression to match “=”

MATLAB: How to automatically save and index data from an internet database at a specified interval

Related Question