[Math] How to determine the number of possible combinations of letters that contain a degenerate substring

combinatoricsprobability

I've been racking my brain for a couple of days to work out a series or closed-form equation to the following problem:

Specifically: given all strings of length $N$ that draws from an alphabet of $L$ letters (starting with 'A', for example {A, B}, {A, B, C}, …), how many of those strings contain a substring that matches the pattern: 'A', more than 1 not-'A', 'A'. The standard regular expression for that pattern would be A[^A][^A]+A.

The number of possible strings is simple enough: $L^N$ . For small values of $N$ and $L$, it's also very practical to simply create all possible combinations and use a regular expression to find the substrings that match the pattern; in R:

all.combinations <- function(N, L) {
    apply(
        expand.grid(rep(list(LETTERS[1:L]), N)),
        1,
        paste,
        collapse = ''
    )
}

matching.pattern <- function(N, L, pattern = 'A[^A][^A]+A') {
    sum(grepl(pattern, all.combinations(N, L)))
}

all.combinations(4, 2)
matching.pattern(4, 2)

I had come up with the following, which works for N < 7:

$$M(N, L) = \sum_{g=2}^{N-2} (N – g – 1) \cdot (L – 1)^g \cdot L^{(N – g -2)}$$

Unfortunately, that only works while N < 7 because it's simply adding the combinations that have substrings A..A, A…A, A….A, etc. and some combinations obviously have multiple matching substrings (e.g., A..A..A, A..A…A), which are counted twice.

For example: when N = 4 and L = 2 ({A, B}), there are $L^N = 16$ possible strings: AAAA, BAAA, ABAA, BBAA, AABA, BABA, ABBA, BBBA, AAAB, BAAB, ABAB, BBAB, AABB, BABB, ABBB, BBBB. Only one of those strings matches the pattern 'A', more than 1 not-'A', 'A': ABBA. If you are familiar with regular expression syntax, the pattern is expressed as A[^A][^A]+A.

When N = 4, and L = 3 ({A, B, C}), there are $L^N = 81$ possible strings, and there are 4 strings that match the pattern: ABBA, ABCA, ACBA, ACCA.

When N = 6, and L = 2, there are $L^N = 64$ combinations, and 17 strings that match the pattern: ABBAAA, AABBAA, BABBAA, ABBBAA, ABBABA, AAABBA, BAABBA, ABABBA, BBABBA, AABBBA, BABBBA, ABBBBA, ABBAAB, AABBAB, BABBAB, ABBBAB, ABBABB

Any suggestions?

For what it is worth, here's the number of combinations, and matching combinations for some values of N and L that are tractable to determine by generating all combinations and doing a regular expression match:

 N  L  combinations  matching
--  -  ------------  --------
 4  2            16         1
 5  2            32         5
 6  2            64        17
 7  2           128        48
 8  2           256       122
 9  2           512       290
10  2          1024       659
 4  3            81         4
 5  3           243        32
 6  3           729       172
 7  3          2187       760
 8  3          6561      2996
 9  3         19683     10960
10  3         59049     38076
 4  4           256         9
 5  4          1024        99
 6  4          4096       729
 7  4         16384      4410
 8  4         65536     23778
 9  4        262144    118854
10  4       1048576    563499

Best Answer

There are many ways to solve this problem. The following one is of particular statistical interest. At the end it is verified with a brute-force enumeration which itself works remarkably well.


All regular expressions are implemented as Finite State Automata (FSA). The regular expression in the question, A[^A][^A]+A, can be a little more explicitly written in the form

^.*A[^A][^A]+A.*$

corresponding to this FSA:

FSA diagram

The letters of the word are fed to the FSA in order. Transitions are made along arrows labeled by the letters: "A" for an "A" and "A'" for any non-"A". The states can be described as

  1. State $1$ is the initial state.

  2. State $2$ occurs as soon as the first "A" is encountered.

  3. State $3$ records observing a subsequent non-"A".

  4. State $4$ records observing two successive non-"A" characters.

  5. State $5$ is arrived at upon observing another "A." It is the terminal state.

We say the FSA "accepts" a word when its final state is $5.$

The number of words of length $N$ accepted by any FSA is the total of all possible words of length $N$ (namely, $L^N$) times the probability that a random word selected from all possible ones is accepted. Such a random word is constructed from a sequence of independent uniformly distributed letters. Feeding such a sequence to the FSA produces a Markov Chain.

The transition matrix $\mathbb P = (p_{ij})$ for a Markov chain gives the chance of making a transition from state $i$ to state $j.$ We can read these chances directly off the graph of the FSA, because, no matter what the state might be, the uniform selection of letters implies

  • The chance of observing an "A" is $1/L = p$ and

  • The chance of observing a non-"A" is $1-1/L = 1-p = q.$

Consequently, using the indexing of states as shown in the figure, the transition matrix is

$$\mathbb{P} = \pmatrix{ q & p & 0 & 0 & 0 \\ 0 & p & q & 0 & 0 \\ 0 & p & 0 & q & 0 \\ 0 & 0 & 0 & q & p \\ 0 & 0 & 0 & 0 & 1}$$

Standard techniques of linear algebra enable us to compute the effect of making $n$ transitions in the FSA, which (because all are independent) is given by the matrix $\mathbb{P}^n.$ Its $(1,5)$ entry is--by definition--the chance that the Markov Chain is in the terminal state ($5$) after $n$ transitions. Linear algebra tells us the formula will be a linear combination of $N^\text{th}$ powers of the eigenvalues of $\mathbb{P}$ whose coefficients are polynomials in $N.$

FWIW, for this particular FSA the formula is given by the following R function f; the variables e0 and e1 (as well as 1, l, and l-1) are directly proportional to those eigenvalues:

f <- Vectorize(function(n, l) {
  if (l==3) {
    (2 * (-1)^n - (56 + 15*n + 9*n^2) * 2^n + 54 * 3^n) / 54
  } else {
    delta <- sqrt(4*l - 3)
    e0 <- (1 - delta)/2
    e1 <- (1 + delta)/2
    (((e1^n - e0^n) * (3 + 2*l))/(2 * (e0 - e1)) - 3/2 * (e1^n + e0^n) + 
        (l - 3)^2 * l^n - (l - 1)^n * (6 - 3*n + l * (l + n - 6))) / (l - 3)^2
  }
})

(I found this through an elementary combinatorial analysis of the regular expression.) It produces the same output as the functions f and enumerate presented below. However, this is a relatively uninteresting solution because it is specific to the particular regular expression of the question. The remainder of this answer discusses approaches that generalize.

It is practicable to compute the powers of $\mathbb{P}$ by brute force. For instance, the R program below tabulates all the answers for $N \le 25$ and $L \le 10$ within 0.01 seconds. Obtaining all answers for $N\le 500$ for any given $L$ takes only a second--and almost all of that effort consists of a one-time calculation to determine the fastest way to compute the powers. It's not worth using double-precision floats for the computation when $N \gt 1000,$ because even when $L=2$ the values overflow.

Here's a piece of the output. The smaller parts agree with the tabulation in the question (and with my own independent tabulation).

      l=2    l=3      l=4       l=5
n=4     1      4        9        16
n=5     5     32       99       224
n=6    17    172      729      2096
n=7    48    760     4410     16128
n=8   122   2996    23778    110672
n=9   290  10960   118854    704288
n=10  659  38076   563499   4251248
n=11 1451 127464  2570769  24686912
n=12 3123 415236 11395539 139221648

Code

In principle, one could write a program to produce the transition matrix $\mathbb P$ given any regular expression and alphabet length. That's too much to tackle here, so I just hard-coded the transition matrix in the function fsa. Preliminary functions power.optimize and matrix.power provide a mechanism to compute matrix powers quickly: they are optimized for tabulating a large number of results involving relatively small powers of many different matrices. (The former can take considerable time for $N\gg 500,$ but that's a problem requiring extended-precision or exact arithmetic anyway.) The actual work, performed by the function f, amounts to computing $\mathbb{P}$ (which depends on $L$), taking its $N^\text{th}$ power, and returning its $(1,5)$ entry.

#
# Find the fastest way to compute powers up to `n`.
# Returns a data structure useful for performing the computation.
#
power.optimize <- function(n) {
  components <- as.list(1:n)
  combine <- function(i, j) union(components[[i]], components[[j]])

  if (n > 1) {
    for (k in 2:n) {
      cost <- sapply(1:(k-1), function(i) 1 + length(combine(i, k-i)))
      i <- which.min(cost)
      components[[k]] <- c(k, combine(k-i, i))
    }
  }
  components
}
#
# Compute a positive integral power of a matrix.
#
matrix.power <- function(x, n, components) {
  if (missing(components)) components <- power.optimize(n)

  powers <- list()
  powers[[1]] <- x
  for (i in rev(setdiff(components[[n]], 1))) {
    k <- components[[i]][2]
    j <- components[[i]][1] - k
    powers[[i]] <- powers[[j]] %*% powers[[k]]
  }

  powers[[n]]
}
#
# Create the transition matrix for the regular expression "^.*A[^A][^A]+A.*$"
# where "A" is from an alphabet of `L` letters.
#
fsa <- function(L) {
  p <- 1/L
  q <- 1-p
  matrix(c(q,0,0,0,0,
           p,p,p,0,0,
           0,q,0,0,0,
           0,0,q,q,0,
           0,0,0,p,1), 5)
}
#
# Tabulate the results.
#
f <- Vectorize(function(N, L, components) {
  x <- fsa(L)
  x.N <- matrix.power(x, N, components)
  L^N * x.N[1,5]
}, c("N", "L"))

N <- 4:12; names(N) <- paste0("n=", N)
L <- 2:5; names(L) <- paste0("l=", L)
components=power.optimize(max(N))
results <- outer(N, L, f, components=components)
results

More code

The following checks the answer via a complete enumeration of the non-matching strings. Its output, which is almost instantaneous, is the same as the previous.

It employs a simplification: since the regular expression distinguishes only between "A" and non-"A", the calculation can be carried out over a binary alphabet (representing these two classes of letters) where the letters are now chosen with probabilities $1/L$ and $1-1/L,$ respectively. Thus its time and space demands are only those of a binary alphabet rather than one with all $L$ letters. It can compute the full table (out to $N=25$ and $L=10$) in a minute or two.

This solution is very flexible: it can be applied to any regular expression (perhaps with minor modifications for more complex expressions).

#
# Enumeration.
#
enumerate <- Vectorize(function(N, L, prob, r="^.*A[^A][^A]+A.*$", alphabet=c("A","B")) {
  if (missing(prob)) {
    prob <- c(1,L-1)/L
    alphabet <- alphabet[1:2]
  }
  non.match <- "" # The non-matching prefixes
  p <- 1          # Their probabilities
  for (i in 1:N) {
    words <- c(outer(non.match, alphabet, paste0))   # Lengthen all non-matches
    p <- c(outer(p, prob))                           # Compute their probabilities
    i <- !grepl(r, words)                            # Search for the pattern
    non.match <- words[i]                            # Eliminate matches
    p <- p[i]                                        # Eliminate their probabilities
  }
  L^N  * (1 - sum(p))                                # Count the matches
}, c("N", "L"))


N <- 4:12; names(N) <- paste0("n=", N)
L <- 2:5; names(L) <- paste0("l=", L)
outer(N, L, enumerate)

References

Aho, Sethi, and Ullman, Compilers: Principles, Techniques, and Tools, First Edition.