MATLAB: Selecting parts of HTML file

file processinghtmltext file

Hello,
I have a series of text files in a directory, serially numbered, *Part 1*, Part 2, Part 3 … These are actually HTML files, but I can save them as text files also. One such file, Part 4, is attached herewith. (There is '3' at the begining of the title, but it is a part of the title. It is not the serial number. '3' occurs in every file in the directory.) The structure of all these files is exactly the same. The region of interest always occurs from line 26 to line 40.
I wish to save all the vernacular text from line 26 to line 40 in a separate text file, and the words in the bracket immediately following these vernacular words in another separate text file. The vernacular text always occurs after the serial number, followed by fullstop, followed by space, followed by asterix. The words following this vernacular text always occur within open and closed brackets after space preceded by the vernacular text.
How to take these in two separate text files for all of the html files in the directory at once?
Thanks.

Best Answer

First read the html files (you can get my readfile function from the FEX. If you are using R2017a or later, you can also get it through the AddOn-manager, alternatively on R2020b you can use readlines):
data=readfile('https://www.mathworks.com/matlabcentral/answers/uploaded_files/424138/3%20letter%20Hindi%20words%20without%20matra%20%E2%80%93%20Part%204%20%E2%80%93%20Kathakar.txt');
lines_of_interest=data(26:40);
What you need to do next is to parse the specific lines. You already have the patterns you're looking for. There is an optimal way with a regular expression, and an easy way with several call to strfind. If you have trouble implementing that, don't hesitate to post a comment with what you tried.