[Math] Converting Context Free Grammar to Chomsky Normal Form

context-free-grammarformal-grammarformal-languages

This is an exercise that I had to complete in my class and I struggled a lot with it

$$\begin{align*}
&S\to 0A0\mid 1B1\mid BB\\
&A\to C\\
&B\to S\mid A\\
&C\to S\mid\epsilon
\end{align*}$$

Added from comments: This is what I have but I am sure it is wrong: replace epsilon productions:

$$\begin{align*}
&S\to 0A0\mid 1B1\mid BB\\
&A\to C\\
&B\to S\mid A\\
&C\to S
\end{align*}$$

replace unit productions:

$$\begin{align*}
&S\to 0A0\mid 1B1\mid BB\\
&A\to C\mid 0A0\mid 1B1\mid BB\\
&B\to S\mid A\\
&C\to S\mid 0A0\mid 1B1\mid BB
\end{align*}$$

replace mixed productions:

$$\begin{align*}
&S\to T0AT0\mid T1BT1\mid BB\\
&A\to C\mid T0AT0\mid T1BT1\mid BB\\
&B\to S\mid A\\
&C\to S\mid T0AT0\mid T1BT1\mid BB
\end{align*}$$

Best Answer

One way to see that it has to be wrong is to notice that there are no terminal productions, while the original grammar allows the derivation of $00$:

$$S\Rightarrow 0A0\Rightarrow 0C0\Rightarrow 00\;.$$

I will more or less follow this algorithm, which may differ slightly from the one that you’ve been shown, to convert the original grammar to Chomsky normal form.

Let $T$ be a new initial symbol, and add the production $T\to S$. The next step is to eliminate the $\epsilon$ productions; the only one of these is $C\to\epsilon$, which I remove from the grammar. The only production with $C$ on the righthand side is $A\to C$; adding a new production with $C$ replaced by $\epsilon$ gives us the following set of productions:

$$\begin{align*} &T\to S\\ &S\to 0A0\mid 1B1\mid BB\\ &A\to C\mid\epsilon\\ &B\to S\mid A\\ &C\to S \end{align*}$$

This introduces a new $\epsilon$ production, $A\to\epsilon$, which must now be eliminated. After removing $A\to\epsilon$ and adding all productions obtained by substituting $\epsilon$ for $A$ on the righthand side of an existing production, we have the following set of productions:

$$\begin{align*} &T\to S\\ &S\to 0A0\mid 00\mid 1B1\mid BB\\ &A\to C\\ &B\to S\mid A\mid\epsilon\\ &C\to S \end{align*}$$

Once again we’ve introduced an $\epsilon$ production; repeating the procedure that we’ve used twice leaves us with

$$\begin{align*} &T\to S\\ &S\to 0A0\mid 00\mid 1B1\mid 11\mid BB\mid B\mid\epsilon\\ &A\to C\\ &B\to S\mid A\\ &C\to S\;, \end{align*}$$

from which we have to eliminate $S\to\epsilon$. Note that during the elimination of $\epsilon$ productions we do not add an $\epsilon$ production that was previously removed, so we don’t add $B\to\epsilon$ or $C\to\epsilon$, and this step leaves us with the following set of productions:

$$\begin{align*} &T\to S\mid\epsilon\\ &S\to 0A0\mid 00\mid 1B1\mid 11\mid BB\mid B\\ &A\to C\\ &B\to S\mid A\\ &C\to S \end{align*}$$

Since $T$ is the initial symbol, the production $T\to\epsilon$ is permissible, so we can move on to eliminating the unit productions. The first of these is $T\to S$, and its elimination leaves us with the following set of productions:

$$\begin{align*} &T\to 0A0\mid 00\mid 1B1\mid 11\mid BB\mid B\mid\epsilon\\ &A\to C\\ &B\to 0A0\mid 00\mid 1B1\mid 11\mid BB\mid A\\ &C\to 0A0\mid 00\mid 1B1\mid 11\mid BB\mid B \end{align*}$$

(Note that I eliminated the pointless production $B\to B$.) It’s convenient now to eliminate $A\to C$:

$$\begin{align*} &T\to 0A0\mid 00\mid 1B1\mid 11\mid BB\mid B\mid\epsilon\\ &A\to 0A0\mid 00\mid 1B1\mid 11\mid BB\mid B\\ &B\to 0A0\mid 00\mid 1B1\mid 11\mid BB\mid A\\ &C\to 0A0\mid 00\mid 1B1\mid 11\mid BB\mid B \end{align*}$$

$C$ is now an orphan: it cannot be generated by any production, so we can drop all of the $C$ productions:

$$\begin{align*} &T\to 0A0\mid 00\mid 1B1\mid 11\mid BB\mid B\mid\epsilon\\ &A\to 0A0\mid 00\mid 1B1\mid 11\mid BB\mid B\\ &B\to 0A0\mid 00\mid 1B1\mid 11\mid BB\mid A \end{align*}$$

The only thing that $B$ produces that is not already produced by $A$ is $A$ itself, so we can simply remove the unit production $A\to B$, at which point $B\to A$ and then $T\to B$ are also removable without further adjustment:

$$\begin{align*} &T\to 0A0\mid 00\mid 1B1\mid 11\mid BB\mid\epsilon\\ &A\to 0A0\mid 00\mid 1B1\mid 11\mid BB\\ &B\to 0A0\mid 00\mid 1B1\mid 11\mid BB \end{align*}$$

At this point we might notice that $A$ and $B$ are synonymous, so we can replace $B$ by $A$ everywhere and drop the $B$ productions:

$$\begin{align*} &T\to 0A0\mid 00\mid 1A1\mid 11\mid AA\mid\epsilon\\ &A\to 0A0\mid 00\mid 1A1\mid 11\mid AA \end{align*}$$

Next I’ll replace $0$ by $V_0$ and $1$ by $V_1$, adding productions $V_0\to 0$ and $V_1\to 1$:

$$\begin{align*} &T\to V_0AB_0\mid V_0V_0\mid V_1AV_1\mid V_1V_1\mid AA\mid\epsilon\\ &A\to V_0AV_0\mid V_0V_0\mid V_1AV_1\mid V_1V_1\mid AA\\ &V_0\to 0\\ &V_1\to 1 \end{align*}$$

The final step is to get rid of the four productions whose righthand sides are too long:

$$\begin{align*} &T\to V_0X_0\mid V_0V_0\mid V_1X_1\mid V_1V_1\mid AA\mid\epsilon\\ &A\to V_0X_0\mid V_0V_0\mid V_1X_1\mid V_1V_1\mid AA\\ &X_0\to AV_0\\ &X_1\to AV_1\\ &V_0\to 0\\ &V_1\to 1 \end{align*}$$

(I think it fair to say that this was set up to be a fairly messy problem.)