MATLAB: How to automatically save and index data from an internet database at a specified interval

commoditiesdata acquisitionintervalreal timescrapestocksurlread

I want to be able to read source code from a URL and extract variable values (such as commodity prices or stocks) at a specified interval automatically (hourly or daily). The extracted values would then, ideally be appended to a matrix of value so I can look at fluctuations over time. Is this possible in MATLAB?
If so, is it elegant?
How would you suggest approaching the problem?
If not, is there another language I should consider? For what benefits?
Thanks.

Best Answer

Most languages will allow you to extract data from the internet. Relevant questions might be..
  • Where to get data? Is it free, historical, real time, reliable, etc?
  • Is there an API available or do you need to parse web pages by yourself?
  • How much time can you afford spending in writing your own parser?
  • Is it meaningful to build some data logger when historical data are available?
  • If you use MATLAB, can you afford having it dedicated to data logging?
I would personally use Python at least for the data logging part, as this language usually minimizes the time to solution (and you might prefer investing your time in data analysis than in building some web crawler). There are plenty of libs for Python that will help you doing almost everything (I have seen many threads about that even though I've not been working on it myself). But more than that, I could not afford having MATLAB stuck with data extraction/logging a significant part of the day, everyday.
Now if you just want to play a little in MATLAB to see what you can do, it is not too difficult to build a simple code for extracting/logging data .. try the following for example:
Open http://www.google.com/finance?q=AAPL for having the Apple quote. 'AAPL' is the stock symbol and you see it appears in the URL. Open the source of the webpage (CTRL+u in Firefox) and lookup for the price (431.72 as I write). You'll find it at a place that will look like (with different numbers)
values:["AAPL","Apple Inc.","431.72","+1.14","chg","0.26"
which is probably a good chunk of string for pattern matching (because it is close to the stock symbol).
Now in MATLAB, do the following:
>> stockSymbol = 'AAPL' ;
>> buffer = urlread(['http://www.google.com/finance?q=', stockSymbol]) ;
If you look at the content of buffer, you'll recognize the source code of the web page. So at this point you want to extract the quote based on pattern matching. You can achieve this with a regexp:
>> pattern = ['values:["', stockSymbol, '",".*?","(?<price>[\d\.]*?)","(?<change>[+-\d\.]*?)".*?"(?<percent>[+-\d\.]*?)"'] ;
>> quote = regexp(buffer, pattern, 'names')
quote =
price: '431.72'
change: '+1.14'
percent: '0.26'
and voila! Then you can convert to double, store in a file, or anything else. I could describe a little better the pattern, but let's say for now that it is defined so it matches some static literal like "values:[", some literal again that is the stock symbol, and then the three numbers framed by double-quotes, comas, etc. Each part meant to match a number (including special characters like +-. when relevant) is saved as a named token. These tokens names are used to define the struct that is output-ed by regexp.
Wrapping the whole into a cute function, you get:
function quote = getQuote_google(stockSymbol)
buffer = urlread(['http://www.google.com/finance?q=', stockSymbol]) ;
pattern = ['values:["', stockSymbol, '",".*?","(?<price>[\d\.]*?)","(?<change>[+-\d\.]*?)".*?"(?<percent>[+-\d\.]*?)"'] ;
quote = regexp(buffer, pattern, 'names') ;
end
that you can then easily use as follows:
>> quote = getQuote_google('AAPL')
quote =
price: '431.72'
change: '+1.14'
percent: '0.26'
>> quote = getQuote_google('GOOG')
quote =
price: '831.52'
change: '-1.08'
percent: '-0.13'