MATLAB: Regexp with multiple lines

regexp url read

Hello, I would like to use urlread and regexp to extract a specific number from the url. Part oftThe content of the url is:

<td class="text-right">1</td>
<td data-sort="Ethereum"><img src="https://s2.coinmarketcap.com/static/img/coins/16x16/1027.png" class="logo" alt="Ethereum"><a href="/currencies/ethereum/" class="market-name">Ethereum</a></td>
<td data-sort="ETH/USD"><a href="https://www.kraken.com" target="_blank">ETH/USD</a></td>
<td class="text-right" data-sort="41450600.0">
<span class="volume" data-usd="41450600.0" data-btc="4441.21" data-native="54539.6">
$41,450,600
</span>
</td>
<td class="text-right" data-sort="760.01">
<span class="price" data-usd="760.01" data-btc="0.0814309" data-native="760.01">
$760.01
</span>
</td>
<td class="text-right" data-sort="20.679496448"><span data-format-percentage data-format-value="20.679496448">20.68</span>%</td>
<td class="text-right ">Recently</td>
</tr>
<tr>
<td class="text-right">2</td>
<td data-sort="Bitcoin"><img src="https://s2.coinmarketcap.com/static/img/coins/16x16/1.png" class="logo" alt="Bitcoin"><a href="/currencies/bitcoin/" class="market-name">Bitcoin</a></td>
<td data-sort="BTC/EUR"><a href="https://www.kraken.com" target="_blank">BTC/EUR</a></td>
<td class="text-right" data-sort="36350300.0">
<span class="volume" data-usd="36350300.0" data-btc="3894.74" data-native="3905.16">
$36,350,300
</span>
</td>
<td class="text-right" data-sort="9308.28">
<span class="price" data-usd="9308.28" data-btc="0.997331" data-native="7839.6">
$9308.28
</span>
I would like to extract this content, that begins with the word Ethereum and finishes with the number 760.01:
Ethereum</a></td>
<td data-sort="ETH/USD"><a href="https://www.kraken.com" target="_blank">ETH/USD</a></td>
<td class="text-right" data-sort="41450600.0">
<span class="volume" data-usd="41450600.0" data-btc="4441.21" data-native="54539.6">
$41,450,600
</span>
</td>
<td class="text-right" data-sort="760.01">
<span class="price" data-usd="760.01" data-btc="0.0814309" data-native="760.01">
$760.01

I'm trying to use this code, but I don't know what expression to use:

urlKraken='https://coinmarketcap.com/exchanges/kraken/';
strC=urlread(urlKraken);
expression='';
[startIndex,endIndex] = regexp(strC,expression);

Best Answer

The value if very variable, so you need to extract it first, before you can compose the appropriate expression: Ethereum.*\$760.01

%load the data
urlKraken='https://coinmarketcap.com/exchanges/kraken/';
strC=urlread(urlKraken);%#ok<URLRD> apparently you're on an old release
%find a few relevant markers
ind1=strfind(strC,'ETH/USD');
ind2=strfind(strC,'$');
ind3=strfind(strC,char(10));%#ok<CHARTEN> old releases don't have the newline function
%The value you're looking for is between the second dolar sign after the
%first mention of 'ETH/USD' up to the newline character.
ind1=ind1(1);
ind2(ind2<ind1)=[];ind2=ind2(2);
ind3(ind3<ind2)=[];
val=strC((ind2+1):(ind3(1)-1));
expression=['Ethereum.*\$' val];
[startIndex,endIndex] = regexp(strC,expression);
%if you only want the snippet you show in your question, use this:
expression=['Ethereum</a></td>.*\$' val];
[startIndex,endIndex] = regexp(strC,expression);

Related Solutions

MATLAB: How to automatically save and index data from an internet database at a specified interval

Most languages will allow you to extract data from the internet. Relevant questions might be..

Where to get data? Is it free, historical, real time, reliable, etc?
Is there an API available or do you need to parse web pages by yourself?
How much time can you afford spending in writing your own parser?
Is it meaningful to build some data logger when historical data are available?
If you use MATLAB, can you afford having it dedicated to data logging?

I would personally use Python at least for the data logging part, as this language usually minimizes the time to solution (and you might prefer investing your time in data analysis than in building some web crawler). There are plenty of libs for Python that will help you doing almost everything (I have seen many threads about that even though I've not been working on it myself). But more than that, I could not afford having MATLAB stuck with data extraction/logging a significant part of the day, everyday.

Now if you just want to play a little in MATLAB to see what you can do, it is not too difficult to build a simple code for extracting/logging data .. try the following for example:

Open http://www.google.com/finance?q=AAPL for having the Apple quote. 'AAPL' is the stock symbol and you see it appears in the URL. Open the source of the webpage (CTRL+u in Firefox) and lookup for the price (431.72 as I write). You'll find it at a place that will look like (with different numbers)

 values:["AAPL","Apple Inc.","431.72","+1.14","chg","0.26"

which is probably a good chunk of string for pattern matching (because it is close to the stock symbol).

Now in MATLAB, do the following:

 >> stockSymbol = 'AAPL' ;
 >> buffer = urlread(['http://www.google.com/finance?q=', stockSymbol]) ;

If you look at the content of buffer, you'll recognize the source code of the web page. So at this point you want to extract the quote based on pattern matching. You can achieve this with a regexp:

 >> pattern = ['values:["', stockSymbol, '",".*?","(?<price>[\d\.]*?)","(?<change>[+-\d\.]*?)".*?"(?<percent>[+-\d\.]*?)"'] ;
 >> quote = regexp(buffer, pattern, 'names')
 quote = 
      price: '431.72'
     change: '+1.14'
    percent: '0.26'

and voila! Then you can convert to double, store in a file, or anything else. I could describe a little better the pattern, but let's say for now that it is defined so it matches some static literal like "values:[", some literal again that is the stock symbol, and then the three numbers framed by double-quotes, comas, etc. Each part meant to match a number (including special characters like +-. when relevant) is saved as a named token. These tokens names are used to define the struct that is output-ed by regexp.

Wrapping the whole into a cute function, you get:

 function quote = getQuote_google(stockSymbol)
    buffer  = urlread(['http://www.google.com/finance?q=', stockSymbol]) ;
    pattern = ['values:["', stockSymbol, '",".*?","(?<price>[\d\.]*?)","(?<change>[+-\d\.]*?)".*?"(?<percent>[+-\d\.]*?)"'] ;
    quote   = regexp(buffer, pattern, 'names') ;
 end

that you can then easily use as follows:

 >> quote = getQuote_google('AAPL')
 quote = 
      price: '431.72'
     change: '+1.14'
    percent: '0.26'
 >> quote = getQuote_google('GOOG')
 quote = 
      price: '831.52'
     change: '-1.08'
    percent: '-0.13'

MATLAB: I need to find video tag??

There is no "video" tag in what you show.

Are you looking for

<div id ="video-area-wrapper">

strcmp() and regexp() come to mind.

Best Answer

Related Solutions

MATLAB: How to automatically save and index data from an internet database at a specified interval

MATLAB: I need to find video tag??

Related Question