[Math] Size bound on regular expression describing language of an $n$-state deterministic automaton

computabilitycomputer scienceformal-languages

The class of languages that can be recognised by some deterministic finite automaton is the same as those described by some regular expression.

I evoked this well-known fact in class when discussing finite automata, just after having given the example of the set of decimal strings representing a number divisible by $9$ as an easy example of a language recognised by a ($9$-state) deterministic finite automaton (DFA). But then it struck me that I would be totally uncapable of writing a regular expression for the same language.

I have since been able to locate some methods that in principle will produce a regular expression from a finite automaton, but all of them seem to give rise to utterly unwieldy regular expressions for the kind of automata of the example, in term of their number of states. So my question is what is the best known bound on the size of the shortest regular expression that describes the same language as a given $n$-state DFA. Or more particularly for the language of all strings over the alphabet $\{0,1,\ldots,n-1\}$ whose sum is divisible by $n$, which I expect to be a worst-case example (since the transitions for all characters from any given state go to different states).

I have read in Wikipedia that for the opposite problem of finding a DFA for a given regular expression, the number of states necessary can be exponential in the length of the regular expression; however nothing in the direction of my question.

I found the following three methods described in this paper. Kleene's original method involves numbering the states, then defining $R_{i,j}^k$ to be the language of strings that will pass from state $i$ to state $j$ while never passing through any intermediate states${}>k$, initialising all $R_{i,j}^0$ directly and finally solving the others by induction on $k$. Another more graphical method consists of viewing the DFA as a generalised nondeterministic finite automaton, and then go about suppressing nodes (and combining edges) until just the initial and final state remain (it's not entirely clear to me how to handle multiple accepting states, since they should not be suppressed). Finally the method that I like best is an algebraic one, which views union and concatenation of regular languages as algebraic operations (addition and non-commutative multplication) that satisfy a distributive law, introduces unknowns for all states (corresponding to the language accepted when starting from that state), and translates the state transistion into a set of $n$ right-linear equations (I'm unsure if this is the right terminology, anyway unknows only occur at the right end of terms). This set of equations is then solved using repeated substitution together with the rule that
$$
X = AX + B_1+\cdots+B_n
\quad\text{is solved by}\quad
X = A^*\,(B_1+\cdots+B_n)
$$
provided that $B_1 , \cdots, B_n$ do not contain $X$ and $A$ does not produce the empty string. I have applied these methods to the instance $n=3$ of the example above, in the form of the language over $\{a,b,c\}$ of all strings such that then mbers of letters $a$ plus twice the number of letters $b$ is divisible by $3$. The results are (if I made no mistakes) the expressions

  • $c^*+c^*a(bc^*a)^*bc^*+(c^*b+c^*a(bc^*a)^*bc^*)(ac^*b+ac^*a(bc^*a)^*bc^*b)^*(c^*b+c^*a(bc^*a)^*bc^*b)$

for Kleene's method, and for the other two methods the identical regular expression

  • $(c+bc^*a+(a+bc^*b)(c+ac^*b)^*(b+ac^*a))^*$,

which is slighly better. However, it is not really inviting to attact $n=9$ right away.

Added: I found a lot of interesting information in this article by Jacques Sakarovitch that was refered to in a comment of the cstheory.SE thread indicated in the comment by Raphael. Notably it states: "It is easily seen that the size of a regular expression $E$ computed from an automaton $\mathcal A$ may be exponential in the number of states of $\mathcal A$. A complete graph shows that this combinatorial explosion is unavoidable.". If anyone could give a proof of the latter claim, I think that would settle my question.

Best Answer

Gregor Gramlich, Georg Schnitger: Minimizing nfa's and regular expressions suggests that minimising regular expressions is hard, so I guess there is no known fast algorithm to minimise them (and, consequently, no conversion from automata that leads to minimal/small expressions).

Related Question