MATLAB: Extract strings between one repeated tag in a XML file

regexpxml

I came across this XML file which looks like this:

...
<names id="10" name="Name1" type="one">
<subname id="7" name="subname1"/>
<proname id="70" name="proname1"/>
<proname id="6" name="proname40"/>
</names>
<names id="28" name="Name2" type="three">
<subname id="69" name="subname17"/>
<subname id="41" name="subname62"/>
<subname id="72" name="subname2"/>
<proname id="66" name="proname4"/>
</names>
...

My question is: how can I extract the strings between id="\d*" name="Name1 and </names given that I use names (Name1, Name2,…) to extract the related information (subname and proname)?

I have tried the following expression, but it didn't work out and I don't know why!

regexp(Dat,'(?<=names id="\d+" name="Name1").*(?=<\/names>)','match')

I appreciate any help in advance.

Best Answer

Getting regular expressions to do what you want them to do can be quite a challenge. The most important thing is to read the documentation. And then once again for luck. And did I mention that understanding the documentation is very important?

For example the * operator actually has three modes: Greedy, Lazy, and Possessive. I am not going to explain what these are, because that is what the documentation is for, but your blanket .* will simply read all characters until it can't read any more... (and assuming that the last part of the regexp matches), which means you will get lots of text matching that operator. The solution is to read the documentation and search for the terms I just gave you.

For working with regular expressions you also might like to try my FEX submission MAKEREGEXP:

http://www.mathworks.com/matlabcentral/fileexchange/48930-interactive-regular-expression-maker

MAKEREGEXP creates a figure which lets you interactively change your regular expression and immediately see what effect it has on the REGEXP outputs, giving immediate feedback: this makes developing a regular expression a lot easier. I used this tool to come up with a match expression that might do what you want:

(?<=names id="\d+" name="Name1").*?(?=</names>)

Or you could read each piece of info into its own token, and match every NameXXX:

 (?<=<names)\s+id="(\d+)"\s+name="(\w+)"\s+type="(\w+)">(.*?)(?=</names>

(you would need to return the tokens, not the match).

Or even:

 (?<=<names)\s+id="(\d+)"\s+name="(\w+)"\s+type="(\w+)">(\s*<.+?/>)*?\s*(?=</names>)

Related Solutions

MATLAB: Reading nth value in a line starting with a word

S = fileread('FilenameGoesHere.txt');
parts = regexp(S, 'lvac,\s+m/s\s+(?<col3>\S+)\s+(?<col4>\S+)', 'names', 'once');
col3 = str2double(parts.col3);
col4 = str2double(parts.col4);

This code does not assume integers, but it does assume that there are no stray characters immediately adjacent. For example,

 lvac, m/s   3°  17.2µm

would fail.

MATLAB: Using regular expressions to identify names and unit from string

You could do it in two steps, like this:

str='name1/name2 [C]'
%%Extract unit, if there is one
exp=']|[|}|{';
id=regexp(str,exp);
if isempty(id)==0
    unit=str(id(1)+1:id(2)-1)
end
%%Extract name
exp='/| ';
id=regexp(str,exp);
if numel(id)==2
    name1=str(1:id(1)-1)
    name2=str(id(1)+1:id(2)-1)
else
    name1=str(1:id(1)-1)
    name2=str(id(1)+1:end)
end

If you do not want to save whatever is within the curly brackets as units, then remove those from the first expression.

Best Answer

Related Solutions

MATLAB: Reading nth value in a line starting with a word

MATLAB: Using regular expressions to identify names and unit from string

Related Question