[Math] A CFG for the language $\{ O^i 1^j 2^k 3^l : i=k, j>1 , l>0 \}$: Is the approach correct

context-free-grammarformal-languages

I am trying to construct a CFG for the CFL $\{ O^i 1^j 2^k 3^l : i=k, j>1 , l>0 \}$.
When given $i=2$, $k=2$, $j=2$, and $l=1$, this yields the string $0011223$.

My approach:
$$\begin{align}
S &\Rightarrow A \\
A &\Rightarrow OAB \mid \varepsilon \\
B &\Rightarrow 1B1C \mid \varepsilon \\
C &\Rightarrow 2CD \mid \varepsilon \\
D &\Rightarrow 3D \mid \varepsilon
\end{align}$$

Please explain to me the thought-process associated when solving such problems, I keep on getting similar answers for all questions, and don't know how to properly tackle them. 🙁

Best Answer

Since the number of $\newcommand{\Ze}{\mathtt{0}}\newcommand{\On}{\mathtt{1}}\newcommand{\Tw}{\mathtt{2}}\newcommand{\Th}{\mathtt{3}}\Ze$s must equal the number of $\Tw$s, a good idea is to start basing your CFG around the usual CFG for the language $\{ \Ze^n \Tw^n : n \geq 0 \}$, namely $$S \rightarrow \Ze S \Tw \mid \varepsilon.$$ We see for the language in question that there are at least two $\On$s in between the $\Ze$s and $\Tw$s, and so we can alter the above rule as follows: $$\begin{align*} S &\rightarrow \Ze S \Tw \mid T \\ T &\rightarrow \On T \mid \On \On \end{align*}$$ Next we have to worry about the $\Th$s. Since we only need a positive number of them at the end of the string with no relationship necessary to any of the other symbols, we can introduce a new start symbol that will just split off into the part of the grammar we have introduced so far, and a part that will generate the $\Th$s: $$\begin{align} S_0 &\rightarrow S U \\ S &\rightarrow \Ze S \Tw \mid T \\ T &\rightarrow \On T \mid \On \On \\ U &\rightarrow \Th U \mid \Th \end{align}$$ And we're done.

Among other strings, your grammar will generate the following:

$\varepsilon$ (generated by $S \Rightarrow A \Rightarrow \varepsilon$);
$\Ze \Ze \Ze$ (generated by $S \Rightarrow A \Rightarrow \Ze A B \Rightarrow \Ze \Ze A B B \Rightarrow \Ze \Ze \Ze A B B B \Rightarrow \Ze \Ze \Ze \varepsilon \varepsilon \varepsilon \varepsilon = \Ze \Ze \Ze$);

which are not in the language. Upon inspection, I think your grammar generates the language $$\{ \Ze^m ( \On^n ( \On \Tw^* \Th^* )^n )^m : n , m \geq 0 \}$$

Related Solutions

Context-Free Grammar – CFG for Language

Simple solution

$$S\to E|E1S\\ E \to 0E1E|1E0E|\varepsilon$$

It's clear that anything generated by $E$ will verify $|u|_0 = |u|_1$ and therefore that anything generated by $S$ will verify $|u|_0 \le |u|_1$.

Claim A: $u\in \mathcal L(E)\iff |u|_0 = |u|_1$

Proof A: By induction on $|u|$.

Since $u\in \mathcal L(E)\iff \overline{u}\in \mathcal L(E)$ (where $\overline{u}$ is $u$ with $0$ and $1$ swapped), we can assume WLOG that $u=1v$. Take $w_1$ the greatest prefix of $v$ so that $|w_1|_0=|w_1|_1$. It exists because $\varepsilon$ is one such prefix. Write $v=w_1w_2$ where $w_2$ is just whatever remains. Since we took $w_1$ the greatest with the property, any prefix of $w_2$ has more $0$s than $1s$. In particular, $w_2$ starts with a $0$. So $w_2=0h$. $w_1$ and $h$ both have as many $0$s as $1$s so by the induction hypothesis, $E\to^*w_1$ and $E\to^*h$. So $E\to0E1E\to^*0w_11h=u$ and we're done.

Claim B: $u\in\mathcal L(S)\iff |u|_0\le |u|_1$

Proof B: Left to the reader. One direction come directly from claim A and the other is by induction on $|u|_1-|u|_0$ (and by taking the greatest prefix for which $|u|_0= |u|_1$ to find the part that comes from $E$).

$\Sigma:= \{0,1\}$

If you want to have more $1$s that $0$s, you just have to make sure that whenever you add a $1$ you add a $0$.

This is done with $I$ that means "add one more $1$ than $0$s" (except that it only generates words for which this is false for all their prefixes). Similarly $O$ means "add one more $0$ than $1$s". $U$ therefore means "bounce upwards and come back to where you were" and $D$ "bounce downwards and come back to where you were". $E$ means "bounce upwards or downwards and come back to where you were as many times as you want". $T$ means "(Go one step up or bounce upwards and come back to where you were) as many times as you like". With this interpretation, it feels plausible that $S$ generates what we want because either you take $E$ and it means that you'll have equality, or you take $E1T$ and you'll get strict inequality because once you're done with $E$, you go up once and then $T$ doesn't let you go down so you'll be strictly above the $|v|_1-|v|_0=0$ line.

Here is an example of how $|v|_1-|v|_0$ can evolve with the prefix $v\le u$.

111000000110100111111000111001110                                                              

                               /\                                       
  /\                /\    /\  /                                              
 /  \              /  \  /  \/                                               
/____\____________/____\/_______________________________       
      \          /                                                        
       \  /\/\  /                                                         
        \/    \/  
1I----0I----------1O----11O--111O
U-----D-----------U-----1U---11U-
E-----------------------1T-------

I'll remove $U$ and $D$ from the grammar and will prove it for this one but it should be clear that those two grammars are equivalent.

Claim 1: Define $H(u)$:$|u|_0 > |u|_1$ and for any strict prefix $v<u$, $|u|_0 \le |u|_1$. Then for any $u\in\Sigma^*$,$u\in \mathcal L(O) \iff H(u)$. Note that $H(u)$ implies $H'(u)$:"$|u|_0=|u|_1+1$" and that the last letter of $u$ is $0$.

Proof 1:

"$\Rightarrow$": Let $P(n)$: "If $O\to^m u\in \Sigma^*$ for some $m\le n$, then H(u)". We prove $P(n)$ by induction on $n\ge 1$.
- If $O\to^1 u\in \Sigma^*$, then the production must be $O\to 0$ so that $u=0$ which verifies the properties.
- If $0\to^{n+1} u \in \Sigma^*$ for $n>0$, then there is $v\in (\Sigma\sqcup \Gamma)^*$ so that $I\to v \to^n u$. The first production can only be $O\to 1OO$ so that $v=1OO$. By the fundamental lemma, $u$ can be written as $u=u_1u_2u_3$ so that $1\to^{n_1} u_1$, $O\to^{n_2} u_2$, $O\to^{n_3} u_3$, $n=n_1+n_2+n_3$. Then $H(u)$ follows from $H(u_2)$ and $H(u_3)$ (which are given by the induction hypothesis because $\forall i, n_i \le n$) and $u=0u_2u_3$.
"$\Leftarrow$": Let $P(n)$: "For any $u\in \Sigma^*$ so that $H(u)$, $u\in \mathcal L(O)$". We prove $P(n)$ by induction on $n$.
- $P(0)$ is vacuously true.
- Suppose $P(n)$. Let $u\in \Sigma^*$ so that $|u|=n+1$ and $H(u)$. Then $u=av$ for some $a\in \Sigma$ and $v\in\Sigma^*$.
- If $a=0$, then $a<u$ would lead to $1=|0|_0\le |0|_1=0$. So we have $0=a=u$ the derivation $O\to u$ works.
- If $a=1$, then we take $w_1$ to be the shortest prefix of $v$ so that $H(w_1)$ and we take $w_2$ so that $u=1w_1w_2$ holds. Since $H(u)$ and $H(w_1)$, we also have $H'(u)$ and $H'(w_1)$. This gives us $H'(w_2)$ and so $|w_2|_0>|w_2|_1$. Take $p<w_2$. Then $1w_1p<u$ so that $|w_1|_1+1+|p|_0=|w_1|_0+|p|_0=|1w_1p|_0\le |1w_1p|_1=1+|w_1|_1+|p|_1$ and we indeed have $|p|_0\le |p|_1$.

Claim 2: Define $\overline{H}(u)$:$|u|_1 > |u|_0$ and for any strict prefix $v<u$, $|u|_1 \le |u|_0$. Then for any $u\in\Sigma^*$,$u\in \mathcal L(I) \iff \overline{H}(u)$. Note that $\overline{H}(u)$ implies $\overline{H}'(u)$:"$|u|_1=|u|_0+1$" and that the last letter of $u$ is $1$.

Proof 2: Similar to proof 1.

Claim 3: $u\in \mathcal L(T)\iff$ for any prefix $v\le u$, $|v|_1 \ge |v|_0$.

Proof 3:

"$\Rightarrow$": Follows from claim 1 by induction on the size of the derivation.
"$\Leftarrow$": Take $u\in\Sigma^*$ so that for any prefix $v\le u$, $|v|_1 \ge |v|_0$. First note that the first letter of $u$ must be $1$ (if $u$ is not $\varepsilon$). We will prove it by induction on $|u|$.
- If $u=\varepsilon$, we have $T\to u$.
- If it exists, take $v$ to be the smallest prefix of $u$ so that $|v|_0=|v|_1$. Then $v=1v'$. Since we took the smallest prefix, we have $H(v')$ and we therefore have $O\to^*v'$ by claim 1. Write $u=vw$ for some $w$. Then by the induction hypothesis, $T\to^* w$ and then $T\to^*1OT\to^*1v'w=u$.
- If it doesn't exist, write $u=1v$. The fact that the prefix doesn't exist implies that $v$ also verifies the property. By the induction hypothesis, we therefore have $T\to^*v$. And we conclude that $T\to 1T\to^*1v=u$.

Claim 4: $u\in\mathcal L(E) \iff |u|_0=|u|_1$

Proof 4: Left to the reader. One implication follows directly from claim 1 and 2 and the other implication is a case analysis on the first letter.

Claim 5: $u\in \mathcal L(S)\iff |u|_0\le |u|_1$

Proof 5: Left to the reader. One implication follows directly from claims 3 and 4 and the other one has two cases: $|u|_0=|u|_1$ or not, and if not $u$ is factored into $u_11u_2$ where $u_1$ is the greatest prefix so that $|u_1|_0=|u_1|_1$.

Bonus claim: The grammar is non-ambiguous.

Bonus proof: Left to the reader. The idea is that the languages generated by $E$ and $E1T$ will be disjoint so you only have to check that each part is non-ambiguous separately. The part of the grammar used by $E$ is $LL(1)$ so it's non-ambiguous. Then for a word $u$, suppose you can write it as $e_11t_1=u=e_21t_2$ with $E\to^*e_i$ and $T\to^*t_i$. $e_i$ is the longest prefix $v<u=e_i1t_i$ so that $|v|_0=|v|_1$ because $|e_i|_0=|e_i|_1$ and for each prefix $w<t_i$, $|w|_0\le|w|1$. This implies that $e_1=e_2$ and therefore that $t_1=t_2$. By a previous argument, there is only one derivation tree for $e_1$ so all that remains to be checked is that there is only one for $t_1$, or in other words that the part of the grammar used by $T$ is non-ambiguous. It is because you can only take $T\to1OT$ if you're going to come back to the level you are at and $T\to1T$ if you're not.

Best Answer

Related Solutions

Context-Free Grammar – CFG for Language

Related Question