MATLAB: Regexp: what am I missing from the documentation

I have tried to carefully read the regexp documentation, and I am able to sucessfully implement regexp in the simplest cases. For example, given:

test = 'John3 Ron4 James7 Dongo Chloe8 Billgo Marie7 Aaron8'

I can use the following code to retrieve each of the separate names, with the ending numeral and/or whitespace:

exp = '\w*[^1-9\s]';
MyMatch = regexp(test, exp, 'match')

MyMatch = 1×8 cell array

Columns 1 through 6

{'John'} {'Ron'} {'James'} {'Dongo'} {'Chloe'} {'Billgo'}

Columns 7 through 8

{'Marie'} {'Aaron'}

However, despite much effort, I cannot achieve a more complex result (example provided below). I try to limit the number of questions I post to the community, but here is a situation where I ask if the experts can point to where I am erring in my use of regexp to give a (slightly more complex) result. Note that this is not a specific problem I am trying to solve. I merely invented a 'random' problem in an effort to become more adpept in my use of regexp.

For the following example, assume that all name instances in a character vector test have one of two possible problems.

A single digit immediately follows the name (e.g., James7)
The name has 'go' appended to its end.

NB: We know in advance there are no name instances in test that would require us to consider the possibility that 'go' is just the natural ending of a name instance (e.g., Hugogo).

Thus, given the character vector:

test = 'John3 Ron4 James7 Dongo Chloe8 Billgo Marie7 Aaron8'

The desired output is:

MyMatch = 1×8 cell array

Columns 1 through 6

{'John'} {'Ron'} {'James'} {'Don'} {'Chloe'} {'Bill'}

Columns 7 through 8

{'Marie'} {'Aaron'}

Examples of attempted (and failed) solutions:

% Given the documentation's statement, 'If you specify a lookahead assertion before an expression, 
% the operation is equivalent to a logical AND."
MyMatch = regexp(test, '(?<=\w*[^*go\s)\w*[^1-9\s]', 'match') 
% Attempts to implement 'OR' logic: (exp|exp)
% (1)
[tok, mat] = regexp(test, '(\w+)([^*go\s]|[^1-9\s])', 'tokens', 'match');
vertcat(tok{:}) % then extract col1

% (2)
[tok, mat] = regexp(test, '((\w+)([^*go\s]))|((\w+)([^1-9\s]))', 'tokens', 'match')
vertcat(tok{:}) % then extract col1
% ...

And so on and so forth…

What is your approach/solution (using regexp) to the above? Is it better to take a multipronged approach? e.g., convert to cell array first, use two regexp, etc..
What is your approach/solution (using regexp) given:

test = 'John3 Ron4 James7 Dongo Chloe8 Billgo Marie7 Aaron8 Hugogo' % note Hugogo
% we want the 'MyMatch' or 'MyTokens' cell array to contain 'Hugo'

Thanks for your time, and Happy New Year!

Sincerely,

Ray

Best Answer

A direct interpretation of your description "assume that all name instances in a character vector test have one of two possible problems. 1. A single digit immediately follows the name (e.g., James7) 2. The name has 'go' appended to its end." is to use one lookahead assertion:

>> test = 'John3 Ron4 James7 Dongo Chloe8 Billgo Marie7 Aaron8 Hugogo';
>> regexp(test,'\w+(?=(\d|go)\>)','match')
ans = 
    'John'    'Ron'    'James'    'Don'    'Chloe'    'Bill'    'Marie'    'Aaron'    'Hugo'

Or similarly using a non-captured token:

>> tkn = regexpi(test,'(\w+)(?:\d|go)\>','tokens');
>> [tkn{:}]
ans = 
    'John'    'Ron'    'James'    'Don'    'Chloe'    'Bill'    'Marie'    'Aaron'    'Hugo'

Best Answer

Related Solutions

MATLAB: Regexp: Take the expression with text only

MATLAB: Regex question

Related Question