Hi all,
I'm doing some webscraping from this website. I need to extract the tractor links which are recognized from many lines similar to the following one:
<tr><td><a href="http://www.tractordata.com/farm-tractors/005/4/6/5460-john-deere-20a.html">20A</a></td><td>21 hp</td><td>2008 - 2011</td></tr>
so after the link there is the string '\d* hp'. Here the code I use to detected them:
url='http://www.tractordata.com/farm-tractors/tractor-brands/johndeere/johndeere-tractors.html';html=urlread(url);hyperlinks = regexp(html,'(?<=<tr><td.*>)<a.*?/a>(?=.*{8,50}\d* hp</td>)','match');
This code works rather fine, but I'm not able to get rid of the first wrong result that is:
<a href="http://www.tractordata.com/spacer.gif" height="1" width="1" alt=""></td></tr><tr><td><a href="http://www.tractordata.com/farm-tractors/005/4/6/5460-john-deere-20a.html">20A</a>
As you can see it starts above the link that has to be selected. How can I do to solve it? Thanks
Best Answer