For the particular example $M=2$ and $R=(10 | 1)^\ast 1^\ast$, I suppose if we first convert it to a CFG then there appears to be a general formula that involves convolutions. For example:
- $A \rightarrow \epsilon | 0A | 1A$
- $C \rightarrow \epsilon | 10C | 0C$
- $D \rightarrow \epsilon | 1D$
- $E \rightarrow ACDA$.
Then to get a string of length $n$, concatenate a string from $A$ of length $k_1$, a string from $C$ of length $k_2$, a string from $D$ of length $k_3$ and a string of length $A$ of length $k_4$, then you get
$N_E(n) = \sum_{k_1+k_2+k_3+k_4 = n} N_A(k_1)N_C(k_2)N_D(k_3) N_A(k_4)$
where $0 \leq k_i \leq n$. Ok, so there was no need to use a CFG, we could arrive here directly by identifying the parts!
Note $N_A(k_1) = 2^{k_1}$, $N_D(k_3) = 1$, and $N_E(k_4) = 2^{k_4}$.
$N_C(k_2) = N_C(k_2-1) + N_C(k_2-2)$, Fibonacci recurrence with $N_C(1) = 1$ and $N_C(2) = 2$.
$N_{E,1}(n) = \sum_{k_1+k_2+k_3+k_4=n} 2^{k_1+k_4} F(k_3) = \sum_{a+b \leq n} 2^{a} F(b) = \sum_{b\leq n} (2^{n-b+1}-1) F(b)$.
Looks like this can be attacked with generating functions.
This approach seems to generalize to other regular expressions. Oh and you said you know how to do it for exact matching. So for substring, just add an $A$ before and an $A$ after :) ?
Oh drats... but then you might be double-counting. You'd have to use inclusion-exclusion to subtract the number of strings where the pattern appears at least twice, add back where it appears at least three times, etc. But maybe it is okay.
For $ACDACDA$ (string appears at least twice):
$N_{E,2} = \sum_{k_1+k_2+k_3+k_4+k_5+k_6+k_7=n} = N_A(k_1)N_C(k_2)N_A(k_4)N_C(k_5)N_A(k_7) = \sum_{a+b+c<n}(2^a-1) F(b)F(c)$.
Look for a general pattern?
$N_E(n) = N_{E,1}(n) - N_{E,2}(n) + N_{E,3}(n) + \ldots + N_{E,n}(n)$.
I didn't realize you had a generating function for strings matching $R$. Let us assume this now. Given the generating function $G_R(x)$ that contains number of strings of length $n$ matching $R$, let us figure out some way to obtain the generating function encoding the number of strings of length $n$ containing a substring that matches $R$.
So first, we do the same trick above and pad the sides with arbitrary strings. So the number of strings of length $n$ containing at least 1 substring matching $R$ is the sum $\sum_{a+b \leq n} M^a N_R(b) = \sum_{b \leq n} \frac{M^{n-b+1}-1}{M-1} N_R(b)$.
In general, the number of strings of length $n$ containing at least k substrings matching $R$ is the sum $\sum_{b_1+\ldots+b_k} \frac{M^{n-(b_1+\ldots+b_k)+1}-1}{M-1} N_R(b_1)\cdots N_R(b_k)$.
As a generating function, let $H_k(x)$ correspond to containing at least $k$ substrings.
$H_k(x) = \sum_{j} N_{R,k}(j) x^j = \sum_{j} \sum_{a+b_1+\ldots+b_k = j} C(j-a) x^{j-a} N_R(b_1) x^{b_1} \cdots N_R(b_k) x^{b_k}$
This is a $k$-fold convolution between the generating function of all $M$-strings, and $k-1$ copies of $G_R$.
Now we encode inclusion exclusion:
$H(x) = \sum_{j} \sum_{i=1}^j (-1)^{i-1} N_{R,\ i}(j) x^j = \sum_i (-1)^{i-1} \sum_{j \geq i} N_{R,\ i}(j) x^j $, and this is just the alternating sum of the corresponding partial generating functions. So... I'm going to stop here for the moment.
Best Answer
I think the strings can be defined out of two type of "words".
$a=(100^*)\qquad 10,100,1000,10000,...$
$b=(111^*0)\qquad 110,1110,11110,111110,...$
The criterion is to split the string where a one occurs just after a zero.
So since $a$ and $b$ are starting with $1$ the $1100$ substring cannot be produced.
$\color{green}{0000}\color{red}110\color{red}10\color{red}110\color{red}1000 \color{red}1000\color{red}10\color{red}10\color{red}11110\color{red}100 \color{red}1111110\color{red}1110\color{red}1000\color{red}100\color{red}111110\color{red}100\color{red}100\color{red}10\color{red}110 \color{red}100\color{red}100000\\\color{red}110\color{red}10\color{red}100000 \color{red}10000\color{red}111110\color{red}110\color{red}10\color{red}10 \color{red}110\color{red}110\color{red}100\color{red}11110\color{red}110 \color{red}100\color{red}1110\color{red}1110\color{red}110\color{red}10\color{blue}{1111111}$
Thus we have $\{a,b\}^*$
To complete the representation we also have to take care of:
Note, we do not have to care about trailing zeroes, because it is already accounted into $a$, and we also do not have to care about a possibly ultimate zero because it is already accounted in $b$.
I hope I have made no mistake, this is a bit complicated to deal with all the cases.
So if I have understood what a generating function for a regexp is, it should be: $g(x)=\left(\dfrac 1{1-x}\right)\left(\dfrac 1{1-\left(\left(\dfrac {x^2}{1-x}\right)+\left(\dfrac{x^3}{1-x}\right)\right)}\right)\left(\dfrac 1{1-x}\right)=\dfrac 1{(1-x)(1-x-x^2-x^3)}\\\phantom{g(x)}=\dfrac 1{1-2x+x^4}$