MATLAB: How to select specific urls in a webpage with regexp

MATLABregular expression

Hi all,

I'm doing some webscraping from this website. I need to extract the tractor links which are recognized from many lines similar to the following one:

<tr><td><a href="http://www.tractordata.com/farm-tractors/005/4/6/5460-john-deere-20a.html">20A</a></td><td>21 hp</td><td>2008 - 2011</td></tr>

so after the link there is the string '\d* hp'. Here the code I use to detected them:

url='http://www.tractordata.com/farm-tractors/tractor-brands/johndeere/johndeere-tractors.html';
html=urlread(url);
hyperlinks = regexp(html,'(?<=<tr><td.*>)<a.*?/a>(?=.*{8,50}\d* hp</td>)','match');

This code works rather fine, but I'm not able to get rid of the first wrong result that is:

<a href="http://www.tractordata.com/spacer.gif" height="1" width="1" alt=""></td></tr>
<tr><td><a href="http://www.tractordata.com/farm-tractors/005/4/6/5460-john-deere-20a.html">20A</a>

As you can see it starts above the link that has to be selected. How can I do to solve it? Thanks

Best Answer

Note: avoid greedy .* particularly in complex expressions, it's bound to cause you problems. Negative classes often work better. For example, instead of <td.*>, use <td[^>]*>.

As per Michael comment, your posted regex does not work. But even with the simplified regex:

hyperlinks = regexp(html, '(?<=<tr><td[^>]*>)<a.*?/a>', 'match')' %transposed for easy viewing in command window

you can see that there is a problem. Unfortunately for you, the problem is actually the webpage which is actually not valid html. Your whole problem comes from the fact that the spacer.gif <a hyperlink (on line 131 of the source html) is never closed. So of course, your regex captures everything up to the next a> which belongs to the next <tr><td>.

Unfortunately that makes your life rather difficult. Try:

 hyperlinks = regexp(html, '(?<=<tr><td[^>]*>)<a[^>]*>[^<]*</a>(?=</td><td[^>]*>\d+ hp</td>)', 'match')' %transposed for easy viewing in command window

And if you can report to the website owner that their page is missing a closing tag.

Related Solutions

MATLAB: How to set regexp so that it stops to the first istance

You could use a lazy quantifier ? (explained in the regular expression documentation):

>> urls = regexp(a,'(?<=option value.*)http.*?\.html','match');
>> urls{:}
ans =
http://www.tractordata.com/farm-tractors/003/9/0/3906-massey-ferguson-7465-transmission.html
ans =
http://www.tractordata.com/farm-tractors/006/7/0/6706-massey-ferguson-7465-transmission.html

A more robust method would be to not match " characters:

>> urls = regexp(a,'(?<=option value=")[^"]+\.html','match');

If you want to experiment with regular expressions then you might like to try my Interactive Regular Expression Tool, which shows the outputs of regexp as your type the parse and match strings. You can download it here:

https://www.mathworks.com/matlabcentral/fileexchange/48930-interactive-regular-expression-maker

MATLAB: How to programmatically add an annotation/note with a hyperlink to a Stateflow chart

This can be done by the following example code:

>> sfnew
>> rt = sfroot;
>> c = rt.find('-isa','Stateflow.Chart');
>> a = Stateflow.Annotation(c);
>> a.Interpretation = 'RICH';
>> a.Text = '<html><body><a href="www.mathworks.com">try this</a></body></html>'

In all releases (older releases) which use Stateflow.Note instead of Stateflow.Annotation, it is possible to accomplish the same thing, you just have to use

>> a = Stateflow.Note(c);

for the constructor.

If you want to add MATLAB Code to the hyperlink, please modify the above mentioned HTML Code in a.Text.

Here you can find how to do this:

https://mathworks.com/help/matlab/matlab_prog/create-hyperlinks-that-run-functions.html

Best Answer

Related Solutions

MATLAB: How to set regexp so that it stops to the first istance

MATLAB: How to programmatically add an annotation/note with a hyperlink to a Stateflow chart

Related Question