MATLAB: Extract strings between one repeated tag in a XML file

regexpxml

Hi
I came across this XML file which looks like this:
...
<names id="10" name="Name1" type="one">
<subname id="7" name="subname1"/>
<proname id="70" name="proname1"/>
<proname id="6" name="proname40"/>
</names>
<names id="28" name="Name2" type="three">
<subname id="69" name="subname17"/>
<subname id="41" name="subname62"/>
<subname id="72" name="subname2"/>
<proname id="66" name="proname4"/>
</names>
...
My question is: how can I extract the strings between id="\d*" name="Name1 and </names given that I use names (Name1, Name2,…) to extract the related information (subname and proname)?
I have tried the following expression, but it didn't work out and I don't know why!
regexp(Dat,'(?<=names id="\d+" name="Name1").*(?=<\/names>)','match')
I appreciate any help in advance.

Best Answer

Getting regular expressions to do what you want them to do can be quite a challenge. The most important thing is to read the documentation. And then once again for luck. And did I mention that understanding the documentation is very important?
For example the * operator actually has three modes: Greedy, Lazy, and Possessive. I am not going to explain what these are, because that is what the documentation is for, but your blanket .* will simply read all characters until it can't read any more... (and assuming that the last part of the regexp matches), which means you will get lots of text matching that operator. The solution is to read the documentation and search for the terms I just gave you.
For working with regular expressions you also might like to try my FEX submission MAKEREGEXP:
MAKEREGEXP creates a figure which lets you interactively change your regular expression and immediately see what effect it has on the REGEXP outputs, giving immediate feedback: this makes developing a regular expression a lot easier. I used this tool to come up with a match expression that might do what you want:
(?<=names id="\d+" name="Name1").*?(?=</names>)
Or you could read each piece of info into its own token, and match every NameXXX:
(?<=<names)\s+id="(\d+)"\s+name="(\w+)"\s+type="(\w+)">(.*?)(?=</names>
(you would need to return the tokens, not the match).
Or even:
(?<=<names)\s+id="(\d+)"\s+name="(\w+)"\s+type="(\w+)">(\s*<.+?/>)*?\s*(?=</names>)