[Math] generating function for binary strings that don’t contain $00100$ as a substring

combinatoricsdiscrete mathematics

On an alphabet $\{0, 1\}$, what's the generating function for the set of strings that don't contain $00100$ as a substring?
I've tried writing the set of strings that don't contain $00100$ in terms of concatenations of other sets(here is where i get stuck) and then use the product lemma.
Any hint on how i might write it?

Best Answer

I have found an alternative and amazingly simple pattern for your sequences, that allows to calculate the generation function in a much more simple way.

Consider all blocks that end with $1$. There are two types of such blocks: $A$ which is either $1$ or $01$, and $B$ which is $000^*1$, i.e. at least two zeros followed by $1$ ($^*$ means repeat zero or more times).

Now, every sequence of $0$'s and $1's$ can be constructed as $\{A;B\}^*0^*$. That is we repeat blocks ending in $1$ and add any number of $0$'s at the end. The resulting sequence does not contain $X=00100$ iff two conditions are met: we don't have two consecutive blocks $B$, i.e. $BB$ is forbidden, and if we end the repetitive part with $B$ then the zero-tail cannot contain more than one $0$.

Overall,

$$A^*(BAA^*)^*\{B;B0;0^*\}$$

The generating functions (note, that if $Z$ has generating function $z(x)$, then $Z^*$ has generating function $\frac{1}{1-z(x)}$):

$$A=\{1;01\}:a(x)=x+x^2$$

$$B=000^*1:b(x)=\frac{x^3}{1-x}$$

And by the product lemma, for $A^*(BAA^*)^*\{B;B0;0^*\}$ we have

$$f(x)=\frac{1}{1-x-x^2}\frac{1}{1-\frac{x^4(1+x)}{(1-x)}\cdot\frac{1}{1-x-x^2}}\cdot\left(\frac{x^3}{1-x}+\frac{x^4}{1-x}+\frac{1}{1-x}\right)=$$

$$=\frac{1+x^3+x^4}{1-2x+x^3-x^4-x^5}.$$

Note, that $f(x)=1+2x+4x^2+8x^3+16x^4+31x^5+60x^6+116x^7+225x^8+437x^9+849x^{10}+\ldots$, and, indeed, we should have $n_l$ sequences of length $l$ where $n_l=2^l$ for $l\le 4$, $n_l=2^l-(l-4)2^{l-5}$ for $5\le l\le 7$, $n_l=2^l-(l-4)2^{l-5}+\sum_{k=1}^{l-7}k2^{k-1}$ for $8\le l\le 10$, etc.

Related Solutions

How Many Strings Contain Every Letter of the Alphabet

Let $W(c,n)$ denote the number of words of length $c$ from an alphabet of $n$ letters. Then $W(c,n)=n^c$.

Out of these, the number of words of the same size that do not contain one of the letters is $W(c,n-1)=(n-1)^c$. The number of ways of choosing which letter is missing is $\binom{n}{1}$.

The number of words of the same size that do not contain two letters is $W(c,n-2)=(n-2)^c$. The number of ways of choosing which two letters are missing is $\binom{n}{2}$... and so on ...

Now we use inclusion-exclusion principle: (subtract the number of words missing one of the letters, then add the number missing two of the letters, subtract the number missing three of the letters,...)

We get:

$$W(c,n)-\binom{n}{1}W(c,n-1)+\binom{n}{2}W(c,n-2)-\binom{n}{3}W(c,n-3)+\cdots+(-1)^{n-1}\binom{n}{n-1}W(c,n-(n-1)).$$

This is

$$n^c-\binom{n}{1}(n-1)^c+\binom{n}{2}(n-2)^c-\binom{n}{3}(n-3)^c+\cdots+(-1)^{n-1}\binom{n}{n-1}1^c.$$

$$\sum_{k=0}^{n-1}(-1)^k\binom{n}{k}(n-k)^c.$$

Another way could be: Denote $S_c^n$ the number of ways to partition the word of length $c$ into $n$ pieces. Then we just need to choose which letter goes to each of the $n$ pieces. This number is $n!$. So the number of words we are looking for is

$$n!S_c^n.$$

The numbers $S_c^n$ are called Stirling's numbers of the second kind.

[Math] How to determine the number of possible combinations of letters that contain a degenerate substring

There are many ways to solve this problem. The following one is of particular statistical interest. At the end it is verified with a brute-force enumeration which itself works remarkably well.

All regular expressions are implemented as Finite State Automata (FSA). The regular expression in the question, A[^A][^A]+A, can be a little more explicitly written in the form

^.*A[^A][^A]+A.*$

corresponding to this FSA:

The letters of the word are fed to the FSA in order. Transitions are made along arrows labeled by the letters: "A" for an "A" and "A'" for any non-"A". The states can be described as

State $1$ is the initial state.
State $2$ occurs as soon as the first "A" is encountered.
State $3$ records observing a subsequent non-"A".
State $4$ records observing two successive non-"A" characters.
State $5$ is arrived at upon observing another "A." It is the terminal state.

We say the FSA "accepts" a word when its final state is $5.$

The number of words of length $N$ accepted by any FSA is the total of all possible words of length $N$ (namely, $L^N$) times the probability that a random word selected from all possible ones is accepted. Such a random word is constructed from a sequence of independent uniformly distributed letters. Feeding such a sequence to the FSA produces a Markov Chain.

The transition matrix $\mathbb P = (p_{ij})$ for a Markov chain gives the chance of making a transition from state $i$ to state $j.$ We can read these chances directly off the graph of the FSA, because, no matter what the state might be, the uniform selection of letters implies

The chance of observing an "A" is $1/L = p$ and
The chance of observing a non-"A" is $1-1/L = 1-p = q.$

Consequently, using the indexing of states as shown in the figure, the transition matrix is

$$\mathbb{P} = \pmatrix{ q & p & 0 & 0 & 0 \\ 0 & p & q & 0 & 0 \\ 0 & p & 0 & q & 0 \\ 0 & 0 & 0 & q & p \\ 0 & 0 & 0 & 0 & 1}$$

Standard techniques of linear algebra enable us to compute the effect of making $n$ transitions in the FSA, which (because all are independent) is given by the matrix $\mathbb{P}^n.$ Its $(1,5)$ entry is--by definition--the chance that the Markov Chain is in the terminal state ($5$) after $n$ transitions. Linear algebra tells us the formula will be a linear combination of $N^\text{th}$ powers of the eigenvalues of $\mathbb{P}$ whose coefficients are polynomials in $N.$

FWIW, for this particular FSA the formula is given by the following R function f; the variables e0 and e1 (as well as 1, l, and l-1) are directly proportional to those eigenvalues:

f <- Vectorize(function(n, l) {
  if (l==3) {
    (2 * (-1)^n - (56 + 15*n + 9*n^2) * 2^n + 54 * 3^n) / 54
  } else {
    delta <- sqrt(4*l - 3)
    e0 <- (1 - delta)/2
    e1 <- (1 + delta)/2
    (((e1^n - e0^n) * (3 + 2*l))/(2 * (e0 - e1)) - 3/2 * (e1^n + e0^n) + 
        (l - 3)^2 * l^n - (l - 1)^n * (6 - 3*n + l * (l + n - 6))) / (l - 3)^2
  }
})

(I found this through an elementary combinatorial analysis of the regular expression.) It produces the same output as the functions f and enumerate presented below. However, this is a relatively uninteresting solution because it is specific to the particular regular expression of the question. The remainder of this answer discusses approaches that generalize.

It is practicable to compute the powers of $\mathbb{P}$ by brute force. For instance, the R program below tabulates all the answers for $N \le 25$ and $L \le 10$ within 0.01 seconds. Obtaining all answers for $N\le 500$ for any given $L$ takes only a second--and almost all of that effort consists of a one-time calculation to determine the fastest way to compute the powers. It's not worth using double-precision floats for the computation when $N \gt 1000,$ because even when $L=2$ the values overflow.

Here's a piece of the output. The smaller parts agree with the tabulation in the question (and with my own independent tabulation).

      l=2    l=3      l=4       l=5
n=4     1      4        9        16
n=5     5     32       99       224
n=6    17    172      729      2096
n=7    48    760     4410     16128
n=8   122   2996    23778    110672
n=9   290  10960   118854    704288
n=10  659  38076   563499   4251248
n=11 1451 127464  2570769  24686912
n=12 3123 415236 11395539 139221648

Code

In principle, one could write a program to produce the transition matrix $\mathbb P$ given any regular expression and alphabet length. That's too much to tackle here, so I just hard-coded the transition matrix in the function fsa. Preliminary functions power.optimize and matrix.power provide a mechanism to compute matrix powers quickly: they are optimized for tabulating a large number of results involving relatively small powers of many different matrices. (The former can take considerable time for $N\gg 500,$ but that's a problem requiring extended-precision or exact arithmetic anyway.) The actual work, performed by the function f, amounts to computing $\mathbb{P}$ (which depends on $L$), taking its $N^\text{th}$ power, and returning its $(1,5)$ entry.

#
# Find the fastest way to compute powers up to `n`.
# Returns a data structure useful for performing the computation.
#
power.optimize <- function(n) {
  components <- as.list(1:n)
  combine <- function(i, j) union(components[[i]], components[[j]])

  if (n > 1) {
    for (k in 2:n) {
      cost <- sapply(1:(k-1), function(i) 1 + length(combine(i, k-i)))
      i <- which.min(cost)
      components[[k]] <- c(k, combine(k-i, i))
    }
  }
  components
}
#
# Compute a positive integral power of a matrix.
#
matrix.power <- function(x, n, components) {
  if (missing(components)) components <- power.optimize(n)

  powers <- list()
  powers[[1]] <- x
  for (i in rev(setdiff(components[[n]], 1))) {
    k <- components[[i]][2]
    j <- components[[i]][1] - k
    powers[[i]] <- powers[[j]] %*% powers[[k]]
  }

  powers[[n]]
}
#
# Create the transition matrix for the regular expression "^.*A[^A][^A]+A.*$"
# where "A" is from an alphabet of `L` letters.
#
fsa <- function(L) {
  p <- 1/L
  q <- 1-p
  matrix(c(q,0,0,0,0,
           p,p,p,0,0,
           0,q,0,0,0,
           0,0,q,q,0,
           0,0,0,p,1), 5)
}
#
# Tabulate the results.
#
f <- Vectorize(function(N, L, components) {
  x <- fsa(L)
  x.N <- matrix.power(x, N, components)
  L^N * x.N[1,5]
}, c("N", "L"))

N <- 4:12; names(N) <- paste0("n=", N)
L <- 2:5; names(L) <- paste0("l=", L)
components=power.optimize(max(N))
results <- outer(N, L, f, components=components)
results

More code

The following checks the answer via a complete enumeration of the non-matching strings. Its output, which is almost instantaneous, is the same as the previous.

It employs a simplification: since the regular expression distinguishes only between "A" and non-"A", the calculation can be carried out over a binary alphabet (representing these two classes of letters) where the letters are now chosen with probabilities $1/L$ and $1-1/L,$ respectively. Thus its time and space demands are only those of a binary alphabet rather than one with all $L$ letters. It can compute the full table (out to $N=25$ and $L=10$) in a minute or two.

This solution is very flexible: it can be applied to any regular expression (perhaps with minor modifications for more complex expressions).

#
# Enumeration.
#
enumerate <- Vectorize(function(N, L, prob, r="^.*A[^A][^A]+A.*$", alphabet=c("A","B")) {
  if (missing(prob)) {
    prob <- c(1,L-1)/L
    alphabet <- alphabet[1:2]
  }
  non.match <- "" # The non-matching prefixes
  p <- 1          # Their probabilities
  for (i in 1:N) {
    words <- c(outer(non.match, alphabet, paste0))   # Lengthen all non-matches
    p <- c(outer(p, prob))                           # Compute their probabilities
    i <- !grepl(r, words)                            # Search for the pattern
    non.match <- words[i]                            # Eliminate matches
    p <- p[i]                                        # Eliminate their probabilities
  }
  L^N  * (1 - sum(p))                                # Count the matches
}, c("N", "L"))


N <- 4:12; names(N) <- paste0("n=", N)
L <- 2:5; names(L) <- paste0("l=", L)
outer(N, L, enumerate)

References

Aho, Sethi, and Ullman, Compilers: Principles, Techniques, and Tools, First Edition.