Regular Expressions – Can Regular Expressions Be Made Unambiguous?

formal-languages

When investigating regular languages, regular expressions are obviously a useful characterisation, not least because they are amenable to nice inductions. On the other hand ambiguity can get in the way of some proofs.

Every regular language is recognized by an unambiguous context-free grammar (take a deterministic automaton which recognises it, and make a production $R \rightarrow tS$ for every edge $R \stackrel{t}{\rightarrow} S$ in the DFA, and $R \rightarrow \epsilon$ for every accepting state $R$).

On the other hand, the natural "grammar" for a regular language is its regular expression. Can these be made unambiguous?

To be precise, let's define a parse for a regular expression (this is I think a natural definition, but not one I've seen named before).

  • $x$ is an $x$-parse of $x$, if $x$ is a symbol or $x=\varepsilon$
  • $(y, 0)$ is an $R\cup R'$-parse of $x$, if $y$ is an $R$-parse of $x$
  • Similarly, $(y,1)$ is an $R\cup R'$-parse of $x$, if $y$ is an $R'$-parse of $x$
  • $(y_1, y_2)$ is an $RR'$-parse for $x_1x_2$, if $y_i$ is an $R$-parse for $x_i$ for $i=1,2$
  • $[]$ is an $R^*$-parse for $\varepsilon$
  • $[y_1, y_2, \dots, y_n]$ is an $R^*$-parse of $x_1x_2\cdots x_n$, if $y_i$ is an $R$-parse for $x_i$ for $1 \le i \le n$

In short, the parses of a string tell us how a regular expression matches a string if it does.

A regular expression $R$ is unambiguous if, for every $x \in L(R)$, there is only one $R$-parse of $x$.

Given a regular expression, is there an unambiguous regular expression which matches the same language?

Best Answer

There's a standard construction of a regular expression from a DFA: define an expression R(i,j,k) for the language of strings that take state i to state j of the DFA while using intermediate states that belong only to the subset of states from state 1 to state k, as follows.

  • R(i,j,0) is [xyz...] where x, y, z etc are the symbols that occur as labels of transitions from state i to state j (there can be no intermediate states). If there is no such transition then R(i,j,0)=0 (the expression for the empty language).
  • Similarly, R(i,i,0) is e + [xyz...] where e is the expression that represents the empty string.
  • For k > 0, R(i,j,k) is R(i,j,k-1) + R(i,k,k-1) R(k,k,k-1)* R(k,j,k-1). That is, any string that takes you from i to j using intermediate states up to k either goes from i to j without going through k, or can be parsed into a sequence of substrings that go from i to k, k back to itself zero or more times, and then k to j.
  • The regular expression for the whole language is then the sum of the expressions R(1,i,n) where i is one of the accepting states.

The first two of these rules obviously give you unambiguous expressions, the third is unambiguous because there's only one way of parsing the string into substrings as described, and the fourth is unambiguous because any given string can only go to a single accepting state. Therefore, the final regular expression constructed in this way is unambiguous.

Related Question