MATLAB: How to set regexp so that it stops to the first istance

Hi all,

I need to extract the urls from the following html code and I am using regexp.

a='<option value="http://www.tractordata.com/farm-tractors/003/9/0/3906-massey-ferguson-7465-transmission.html">2004-2007</option><option value="http://www.tractordata.com/farm-tractors/006/7/0/6706-massey-ferguson-7465-transmission.html" selected>2008-2012</option></select></form></td></tr>';
urls=regexp(a,'(?<=option value.*)http.*html','match');

and the result is:

http://www.tractordata.com/farm-tractors/003/9/0/3906-massey-ferguson-7465-transmission.html">2004-2007</option><option value="http://www.tractordata.com/farm-tractors/006/7/0/6706-massey-ferguson-7465-transmission.html

As you can see the sting extract a string which respects the pattern but it includes two different urls. I need the two following results:

http://www.tractordata.com/farm-tractors/003/9/0/3906-massey-ferguson-7465-transmission.html
http://www.tractordata.com/farm-tractors/006/7/0/6706-massey-ferguson-7465-transmission.html

How may I fix this problem?

Thanks

Pietro

>> urls = regexp(a,'(?<=option value.*)http.*?\.html','match'); >> urls{:} ans = http://www.tractordata.com/farm-tractors/003/9/0/3906-massey-ferguson-7465-transmission.html ans = http://www.tractordata.com/farm-tractors/006/7/0/6706-massey-ferguson-7465-transmission.html

Best Answer

You could use a lazy quantifier ? (explained in the regular expression documentation):

A more robust method would be to not match " characters:

>> urls = regexp(a,'(?<=option value=")[^"]+\.html','match');

If you want to experiment with regular expressions then you might like to try my Interactive Regular Expression Tool, which shows the outputs of regexp as your type the parse and match strings. You can download it here:

https://www.mathworks.com/matlabcentral/fileexchange/48930-interactive-regular-expression-maker

Best Answer

Related Solutions

MATLAB: How to Separte table data from html

MATLAB: How to select specific urls in a webpage with regexp

Related Question