MATLAB: Help extracting numbers from table of mixed strings

extract numbersMATLABmixed stringsregexp

Hi,
I am working with a table of mixed strings, and I need to extract all the numbers and output them into a separate table, or at least a double. Here's my table (4×1 in this case, but won't always be that size):
"|137> Nuclei = all Fe {aiso, adip, aorb, fgrad, rho} # aorb is for second order perturbation to HFC from SOC"
"|138> nuclei = 113 {aiso, adip, fgrad, rho} # carbon on cyanide"
"|139> nuclei = 5, 7, 17, 19, 29, 31, 114 {aiso, adip, fgrad, rho} # nitrogens connected by histidine and cyanide"
"|140> nuclei = 10, 11, 12, 22, 23, 24, 34, 35, 36, 79, 98, 101, 110, 111 {aiso, adip, fgrad, rho} # hydrogens from waters, histidines, tyrosine"
Based on what I've read, it seems regexp() is a good way to do this. However, I can't seem to get the expression right. I have tried numerous expressions, and I always get either 'nan' or '<missing>'. Does anyone think there is a better way to do this besides using regexp()? If not, can someone help me land on the right expression?
Here's a sample of what I've messed with:
for i = 1:height(table)
expression = '?<=='; %just trying to get anything following the '=' to start
new_table{1,i} = regexp(table{i,1}, expression, 'once', 'match'); %trying to get an index output or something here, can str2double() later on
end
This returns a 1×4 cell with <missing> in each cell. The end goal is to have each number as a row in a column (in this case, I would want a 23×1 double that contains 22 extracted numbers and a 'nan' for the "all Fe" case). I can work on formatting that later, but the first step is getting the expression right.
Any insight would be really appreciated!
Molly

Best Answer

Simpler with a regular expression. Here I used a cell array of char vectors, but it will also work with a string array.
str = {...
'|137> Nuclei = all Fe {aiso, adip, aorb, fgrad, rho} # aorb is for second order perturbation to HFC from SOC'
'|138> nuclei = 113 {aiso, adip, fgrad, rho} # carbon on cyanide'
'|139> nuclei = 5, 7, 17, 19, 29, 31, 114 {aiso, adip, fgrad, rho} # nitrogens connected by histidine and cyanide'
'|140> nuclei = 10, 11, 12, 22, 23, 24, 34, 35, 36, 79, 98, 101, 110, 111 {aiso, adip, fgrad, rho} # hydrogens from waters, histidines, tyrosine'};
rgx = '(?<==\s+)(\d+(,\s+\d+)*)?';
tmp = regexpi(str,rgx,'once','match');
fun = @(s)sscanf(s,'%f,',[1,Inf]);
out = cellfun(fun,tmp,'uni',0); % convert to cell array of double vectors

Giving:
>> out{:}
ans =
[]
ans =
113
ans =
5 7 17 19 29 31 114
ans =
10 11 12 22 23 24 34 35 36 79 98 101 110 111
If you also want the leading (row?) number then use this:
rgx = '^|(\d+)>\s+NUCLEI\s+=\s+(\d+(,\s+\d+)*)?';
tkn = regexpi(str,rgx,'once','tokens'); % note: REGEXPI
fun = @(s)sscanf(s,'%f,',[1,Inf]); % exactly the same anonymous function as used above
out = cellfun(fun,vertcat(tkn{:}),'uni',0); % convert to cell array of double vectors