[Math] Counting number of strings that contain substring matching Regular Expression

combinatoricsgenerating-functionsregular expressions

I have regular expression R, and I want to find F(n) is the number of strings of length n that strings that contain substring matching Regular Expression. Suppose the alphabet-size is M.

We can apply the generating function to compute the number of strings of length n that matches R, but i can't find any answer for the above question.

Do you have any idea on that?

Best Answer

For the particular example $M=2$ and $R=(10 | 1)^\ast 1^\ast$, I suppose if we first convert it to a CFG then there appears to be a general formula that involves convolutions. For example:

  • $A \rightarrow \epsilon | 0A | 1A$
  • $C \rightarrow \epsilon | 10C | 0C$
  • $D \rightarrow \epsilon | 1D$
  • $E \rightarrow ACDA$.

Then to get a string of length $n$, concatenate a string from $A$ of length $k_1$, a string from $C$ of length $k_2$, a string from $D$ of length $k_3$ and a string of length $A$ of length $k_4$, then you get

$N_E(n) = \sum_{k_1+k_2+k_3+k_4 = n} N_A(k_1)N_C(k_2)N_D(k_3) N_A(k_4)$

where $0 \leq k_i \leq n$. Ok, so there was no need to use a CFG, we could arrive here directly by identifying the parts!

Note $N_A(k_1) = 2^{k_1}$, $N_D(k_3) = 1$, and $N_E(k_4) = 2^{k_4}$. $N_C(k_2) = N_C(k_2-1) + N_C(k_2-2)$, Fibonacci recurrence with $N_C(1) = 1$ and $N_C(2) = 2$.

$N_{E,1}(n) = \sum_{k_1+k_2+k_3+k_4=n} 2^{k_1+k_4} F(k_3) = \sum_{a+b \leq n} 2^{a} F(b) = \sum_{b\leq n} (2^{n-b+1}-1) F(b)$. Looks like this can be attacked with generating functions.

This approach seems to generalize to other regular expressions. Oh and you said you know how to do it for exact matching. So for substring, just add an $A$ before and an $A$ after :) ?

Oh drats... but then you might be double-counting. You'd have to use inclusion-exclusion to subtract the number of strings where the pattern appears at least twice, add back where it appears at least three times, etc. But maybe it is okay.

For $ACDACDA$ (string appears at least twice): $N_{E,2} = \sum_{k_1+k_2+k_3+k_4+k_5+k_6+k_7=n} = N_A(k_1)N_C(k_2)N_A(k_4)N_C(k_5)N_A(k_7) = \sum_{a+b+c<n}(2^a-1) F(b)F(c)$.

Look for a general pattern?

$N_E(n) = N_{E,1}(n) - N_{E,2}(n) + N_{E,3}(n) + \ldots + N_{E,n}(n)$.


I didn't realize you had a generating function for strings matching $R$. Let us assume this now. Given the generating function $G_R(x)$ that contains number of strings of length $n$ matching $R$, let us figure out some way to obtain the generating function encoding the number of strings of length $n$ containing a substring that matches $R$.

So first, we do the same trick above and pad the sides with arbitrary strings. So the number of strings of length $n$ containing at least 1 substring matching $R$ is the sum $\sum_{a+b \leq n} M^a N_R(b) = \sum_{b \leq n} \frac{M^{n-b+1}-1}{M-1} N_R(b)$.

In general, the number of strings of length $n$ containing at least k substrings matching $R$ is the sum $\sum_{b_1+\ldots+b_k} \frac{M^{n-(b_1+\ldots+b_k)+1}-1}{M-1} N_R(b_1)\cdots N_R(b_k)$.

As a generating function, let $H_k(x)$ correspond to containing at least $k$ substrings.

$H_k(x) = \sum_{j} N_{R,k}(j) x^j = \sum_{j} \sum_{a+b_1+\ldots+b_k = j} C(j-a) x^{j-a} N_R(b_1) x^{b_1} \cdots N_R(b_k) x^{b_k}$

This is a $k$-fold convolution between the generating function of all $M$-strings, and $k-1$ copies of $G_R$.

Now we encode inclusion exclusion:

$H(x) = \sum_{j} \sum_{i=1}^j (-1)^{i-1} N_{R,\ i}(j) x^j = \sum_i (-1)^{i-1} \sum_{j \geq i} N_{R,\ i}(j) x^j $, and this is just the alternating sum of the corresponding partial generating functions. So... I'm going to stop here for the moment.