For the particular example $M=2$ and $R=(10 | 1)^\ast 1^\ast$, I suppose if we first convert it to a CFG then there appears to be a general formula that involves convolutions. For example:
- $A \rightarrow \epsilon | 0A | 1A$
- $C \rightarrow \epsilon | 10C | 0C$
- $D \rightarrow \epsilon | 1D$
- $E \rightarrow ACDA$.
Then to get a string of length $n$, concatenate a string from $A$ of length $k_1$, a string from $C$ of length $k_2$, a string from $D$ of length $k_3$ and a string of length $A$ of length $k_4$, then you get
$N_E(n) = \sum_{k_1+k_2+k_3+k_4 = n} N_A(k_1)N_C(k_2)N_D(k_3) N_A(k_4)$
where $0 \leq k_i \leq n$. Ok, so there was no need to use a CFG, we could arrive here directly by identifying the parts!
Note $N_A(k_1) = 2^{k_1}$, $N_D(k_3) = 1$, and $N_E(k_4) = 2^{k_4}$.
$N_C(k_2) = N_C(k_2-1) + N_C(k_2-2)$, Fibonacci recurrence with $N_C(1) = 1$ and $N_C(2) = 2$.
$N_{E,1}(n) = \sum_{k_1+k_2+k_3+k_4=n} 2^{k_1+k_4} F(k_3) = \sum_{a+b \leq n} 2^{a} F(b) = \sum_{b\leq n} (2^{n-b+1}-1) F(b)$.
Looks like this can be attacked with generating functions.
This approach seems to generalize to other regular expressions. Oh and you said you know how to do it for exact matching. So for substring, just add an $A$ before and an $A$ after :) ?
Oh drats... but then you might be double-counting. You'd have to use inclusion-exclusion to subtract the number of strings where the pattern appears at least twice, add back where it appears at least three times, etc. But maybe it is okay.
For $ACDACDA$ (string appears at least twice):
$N_{E,2} = \sum_{k_1+k_2+k_3+k_4+k_5+k_6+k_7=n} = N_A(k_1)N_C(k_2)N_A(k_4)N_C(k_5)N_A(k_7) = \sum_{a+b+c<n}(2^a-1) F(b)F(c)$.
Look for a general pattern?
$N_E(n) = N_{E,1}(n) - N_{E,2}(n) + N_{E,3}(n) + \ldots + N_{E,n}(n)$.
I didn't realize you had a generating function for strings matching $R$. Let us assume this now. Given the generating function $G_R(x)$ that contains number of strings of length $n$ matching $R$, let us figure out some way to obtain the generating function encoding the number of strings of length $n$ containing a substring that matches $R$.
So first, we do the same trick above and pad the sides with arbitrary strings. So the number of strings of length $n$ containing at least 1 substring matching $R$ is the sum $\sum_{a+b \leq n} M^a N_R(b) = \sum_{b \leq n} \frac{M^{n-b+1}-1}{M-1} N_R(b)$.
In general, the number of strings of length $n$ containing at least k substrings matching $R$ is the sum $\sum_{b_1+\ldots+b_k} \frac{M^{n-(b_1+\ldots+b_k)+1}-1}{M-1} N_R(b_1)\cdots N_R(b_k)$.
As a generating function, let $H_k(x)$ correspond to containing at least $k$ substrings.
$H_k(x) = \sum_{j} N_{R,k}(j) x^j = \sum_{j} \sum_{a+b_1+\ldots+b_k = j} C(j-a) x^{j-a} N_R(b_1) x^{b_1} \cdots N_R(b_k) x^{b_k}$
This is a $k$-fold convolution between the generating function of all $M$-strings, and $k-1$ copies of $G_R$.
Now we encode inclusion exclusion:
$H(x) = \sum_{j} \sum_{i=1}^j (-1)^{i-1} N_{R,\ i}(j) x^j = \sum_i (-1)^{i-1} \sum_{j \geq i} N_{R,\ i}(j) x^j $, and this is just the alternating sum of the corresponding partial generating functions. So... I'm going to stop here for the moment.
To determine if something is in $L$, you should be able to use your definitions backwards. My algorithm for this works by using a depth first search of the possible strings. Also, when I say undo, I mean that given a string like $ub$, to undo it with $u\in L\implies ub\in L$ would leave us with just $u$.
For the string $bbaaba$, it can only be in $L$ if undoing any of the recursion steps leads to a string that is in $L$. To do this, we first see if we can undo the $u\in L\implies ub\in L$. To do this, we see if we can remove a $b$ from the end. Since we cannot, we move onto the next recursive definition. We try to undo $u\in L\implies uab\in L$. We cannot. We then try to undo $u\in L\implies uba\in L$. We can. This means that $bbaaba\in L \text{ if } bbaa\in L$. We then try to undo $u\in L\implies ub\in L$. That fails. Next $u\in L\implies uab\in L$. That also fails. We then try to undo $u\in L\implies uab\in L$. We cannot. Finally, we try to undo $u\in L\implies bua\in L$. This works and leaves us with $ba$. We go through all the recursive definitions on that string and fail every time. We then have to go back to $bbaaba$, and use the last recursion definition. We then try to undo $u\in L\implies bua\in L$. We can, and we are left with $baab$. We then try to undo $u\in L\implies ub\in L$. We can, and we are left with $baa$. We keep searching until we either end up with $b$ or an invalid string.
Best Answer
HINT: You could start by finding a regular expression for the strings that contain the substring $bb$, irrespective of their length: the simplest is
$$(a+b)^*bb(a+b)^*\;.\tag{1}$$
Each of the subexpressions $(a+b)^*$ matches the set of all strings over $\{a,b\}$, so $(1)$ matches anything of the form $ubbv$ with $u,v\in\Sigma^*$, i.e., any string that has the substring $bb$ in it somewhere.
But we want only the strings $ubbv$ of odd length. Now $|uv|=|ubbv|-2$, so $ubbv$ has odd length if and only if $uv$ has odd length. And $|uv|=|u|+|v|$, so we want to generate all pairs of strings $u$ and $v$ such that $|u|+|v|$ is odd. That happens precisely when one of $|u|$ and $|v|$ is odd and the other is even.
Explain why the regular expression $$(aa+ab+ba+bb)^*\tag{2}$$ matches precisely the strings over $\Sigma$ that have even length.
Find a regular expression that matches precisely the strings over $\Sigma$ that have odd length; you may find it helpful to start with $(2)$ and modify or add to it.
Use the previous results to get a regular expression that matches precisely the words $ubbv\in\Sigma^*$ such that $|u|$ is odd and $|v|$ is even.
Do the same for words $ubbv$ such that $|u|$ is even and $|v|$ is odd.
Now put the pieces together to get a regular expression that matches precisely those words over $\Sigma$ that contain the substring $bb$ and have odd length.