MATLAB: HTML Page source info

Hello, many-a-times we come across a series of numbered webpages

basePage.html?page=2
basePage.html?page=3

and so forth, wherein there are several fields identified by their labels:

<h2 class="category-heading">Name1</h2>
<label>Parameter1 : </label> <div class="category-related">textOfInterest</div>
<label>Parameter2 : </label> <div class="category-related">textOfInterest</div>
<label>Parameter3 : </label> <div class="category-related">textOfInterest</div>
<h2 class="category-heading">Name2</h2>
<label>Parameter1 : </label> <div class="category-related">textOfInterest</div>
<label>Parameter2 : </label> <div class="category-related">textOfInterest</div>
<label>Parameter3 : </label> <div class="category-related">textOfInterest</div>
<h2 class="category-heading">Name3</h2>
<label>Parameter1 : </label> <div class="category-related">textOfInterest</div>
<label>Parameter2 : </label> <div class="category-related">textOfInterest</div>
<label>Parameter3 : </label> <div class="category-related">textOfInterest</div>

and so on.

How can the "textOfInterest" of one particular parameter, say, Parameter2, of all the Name*, of all the pages,

basePage.html?page=1toInf

be taken (outputted/exported) into one text file, say, Parameter2.txt?

The "textOfInterest" is often alphanumeric with special characters !@#$% also.

Thanks.

close_div=strfinf(d,'</div>'); param=1; pat=sprintf('<label>Parameter%d : </label> <div class="category-related">',param) position=strfind(d,pat); position=position+numel(pat);%this will be the start of your text of interest texts=cell(size(position)); for n=1:numel(position) end_of_text=close_div(close_div>position(n)); end_of_text=end_of_text(1)-1; texts{n}=d(position(n):end_of_text); end

d=['<h2 class="category-heading">Name1</h2>'... '<label>Parameter1 : </label> <div class="category-related">textOfInterest</div>'... '<label>Parameter2 : </label> <div class="category-related">textOfInterest</div>'... '<label>Parameter3 : </label> <div class="category-related">textOfInterest</div>'... '<h2 class="category-heading">Name2</h2>'... '<label>Parameter1 : </label> <div class="category-related">textOfInterest</div>'... '<label>Parameter2 : </label> <div class="category-related">textOfInterest</div>'... '<label>Parameter3 : </label> <div class="category-related">textOfInterest</div>'... '<h2 class="category-heading">Name3</h2>'... '<label>Parameter1 : </label> <div class="category-related">textOfInterest</div>'... '<label>Parameter2 : </label> <div class="category-related">textOfInterest</div>'... '<label>Parameter3 : </label> <div class="category-related">textOfInterest</div>']; RE=['<label>Parameter\d',... % \d matches a single digit ' : </label> <div class="category-related">',... '(',... % use parentheses to capture a token '[^<]*',... % this matches any number of characters other than < ')',... '</div>']; t=regexp(d,RE,'tokens'); clc celldisp(t)

Best Answer

Related Question

Best Answer

Related Solutions

MATLAB: How to access itemprop = “name” from within a data structure in HTML code using Matlab

MATLAB: Hi , i’m trying to make a simple loop which is give me a useless syntax between what i need , please how can i remove (‘ans=’) thanks anyway

Related Question