MATLAB: Can nested quantifiers in REGEXP can cause inefficient failures in MATLAB 6.5 (R13)

consecutivehangletterlongMATLABquantifierregexpslow

If:
1. A regular expression contains a parenthesized subexpression which is quantified, and the inner expression is quantified as well (for example, '(.+)+') and
2. The text string contains a long series of matches of the subexpression and
3. The regular expression does not match the string,
it can take REGEXP a long time to determine that a match cannot be made. For example,
str = 'ltbwithupdatedbendsandmatchedoptics3/8/04,';
[s f t] = regexp(str,'^\d*(\w+\d*)*:,');
runs for several minutes without returning a result.

Best Answer

This behavior is inherent in the NFA algorithm for regular expressions, used by most systems that support regular expressions, such as Perl, Emacs, and grep. MATLAB uses an optimized NFA algorithm.
The computational complexity of this problem is exponential with the number of characters in the string that must be tested. Expressions that fit this pattern are, by their nature, redundant. When no match is possible, the engine must explore all exponentially-many possibilities before concluding there is no match.
These expressions simplified by removing the outer quantifier --- for example, using '(.+)' instead of '(.+)+'. By reducing the number of duplicate string expansions, you can reduce the work required when no match is possible without affecting which strings the expression matches.
To improve the performance of this regular expression, remove redundancy in the quantifiers. For example, in this command:
[s f t] = regexp(str,'^\d*(\w+\d*)*:,');
the regular expression can be simplified from:
'^\d*(\w+\d*)*:,'
to:
'^\d*(\w*):,'
The non-redundant quantifier produces the same results in parsing the expression as the redundant quantifier, but is much more efficient when no match exists.
Using this expression in the original example, REGEXP reports no matches in well under 1 second:
str = 'ltbwithupdatedbendsandmatchedoptics3/8/04,';
[s f t] = regexp(str,'^\d*(\w*):,');
For more information on the NFA algorithm and building efficient regular expressions, refer to a regular expression reference such as the one listed below.
References
1. Friedl, J.E.F. Mastering Regular Expressions, http://www.oreilly.com/catalog/regex/