There are many ways to solve this problem. The following one is of particular statistical interest. At the end it is verified with a brute-force enumeration which itself works remarkably well.
All regular expressions are implemented as Finite State Automata (FSA). The regular expression in the question, A[^A][^A]+A
, can be a little more explicitly written in the form
^.*A[^A][^A]+A.*$
corresponding to this FSA:
The letters of the word are fed to the FSA in order. Transitions are made along arrows labeled by the letters: "A" for an "A" and "A'" for any non-"A". The states can be described as
State $1$ is the initial state.
State $2$ occurs as soon as the first "A" is encountered.
State $3$ records observing a subsequent non-"A".
State $4$ records observing two successive non-"A" characters.
State $5$ is arrived at upon observing another "A." It is the terminal state.
We say the FSA "accepts" a word when its final state is $5.$
The number of words of length $N$ accepted by any FSA is the total of all possible words of length $N$ (namely, $L^N$) times the probability that a random word selected from all possible ones is accepted. Such a random word is constructed from a sequence of independent uniformly distributed letters. Feeding such a sequence to the FSA produces a Markov Chain.
The transition matrix $\mathbb P = (p_{ij})$ for a Markov chain gives the chance of making a transition from state $i$ to state $j.$ We can read these chances directly off the graph of the FSA, because, no matter what the state might be, the uniform selection of letters implies
Consequently, using the indexing of states as shown in the figure, the transition matrix is
$$\mathbb{P} = \pmatrix{ q & p & 0 & 0 & 0 \\
0 & p & q & 0 & 0 \\
0 & p & 0 & q & 0 \\
0 & 0 & 0 & q & p \\
0 & 0 & 0 & 0 & 1}$$
Standard techniques of linear algebra enable us to compute the effect of making $n$ transitions in the FSA, which (because all are independent) is given by the matrix $\mathbb{P}^n.$ Its $(1,5)$ entry is--by definition--the chance that the Markov Chain is in the terminal state ($5$) after $n$ transitions. Linear algebra tells us the formula will be a linear combination of $N^\text{th}$ powers of the eigenvalues of $\mathbb{P}$ whose coefficients are polynomials in $N.$
FWIW, for this particular FSA the formula is given by the following R
function f
; the variables e0
and e1
(as well as 1
, l
, and l-1
) are directly proportional to those eigenvalues:
f <- Vectorize(function(n, l) {
if (l==3) {
(2 * (-1)^n - (56 + 15*n + 9*n^2) * 2^n + 54 * 3^n) / 54
} else {
delta <- sqrt(4*l - 3)
e0 <- (1 - delta)/2
e1 <- (1 + delta)/2
(((e1^n - e0^n) * (3 + 2*l))/(2 * (e0 - e1)) - 3/2 * (e1^n + e0^n) +
(l - 3)^2 * l^n - (l - 1)^n * (6 - 3*n + l * (l + n - 6))) / (l - 3)^2
}
})
(I found this through an elementary combinatorial analysis of the regular expression.) It produces the same output as the functions f
and enumerate
presented below. However, this is a relatively uninteresting solution because it is specific to the particular regular expression of the question. The remainder of this answer discusses approaches that generalize.
It is practicable to compute the powers of $\mathbb{P}$ by brute force. For instance, the R
program below tabulates all the answers for $N \le 25$ and $L \le 10$ within 0.01 seconds. Obtaining all answers for $N\le 500$ for any given $L$ takes only a second--and almost all of that effort consists of a one-time calculation to determine the fastest way to compute the powers. It's not worth using double-precision floats for the computation when $N \gt 1000,$ because even when $L=2$ the values overflow.
Here's a piece of the output. The smaller parts agree with the tabulation in the question (and with my own independent tabulation).
l=2 l=3 l=4 l=5
n=4 1 4 9 16
n=5 5 32 99 224
n=6 17 172 729 2096
n=7 48 760 4410 16128
n=8 122 2996 23778 110672
n=9 290 10960 118854 704288
n=10 659 38076 563499 4251248
n=11 1451 127464 2570769 24686912
n=12 3123 415236 11395539 139221648
Code
In principle, one could write a program to produce the transition matrix $\mathbb P$ given any regular expression and alphabet length. That's too much to tackle here, so I just hard-coded the transition matrix in the function fsa
. Preliminary functions power.optimize
and matrix.power
provide a mechanism to compute matrix powers quickly: they are optimized for tabulating a large number of results involving relatively small powers of many different matrices. (The former can take considerable time for $N\gg 500,$ but that's a problem requiring extended-precision or exact arithmetic anyway.) The actual work, performed by the function f
, amounts to computing $\mathbb{P}$ (which depends on $L$), taking its $N^\text{th}$ power, and returning its $(1,5)$ entry.
#
# Find the fastest way to compute powers up to `n`.
# Returns a data structure useful for performing the computation.
#
power.optimize <- function(n) {
components <- as.list(1:n)
combine <- function(i, j) union(components[[i]], components[[j]])
if (n > 1) {
for (k in 2:n) {
cost <- sapply(1:(k-1), function(i) 1 + length(combine(i, k-i)))
i <- which.min(cost)
components[[k]] <- c(k, combine(k-i, i))
}
}
components
}
#
# Compute a positive integral power of a matrix.
#
matrix.power <- function(x, n, components) {
if (missing(components)) components <- power.optimize(n)
powers <- list()
powers[[1]] <- x
for (i in rev(setdiff(components[[n]], 1))) {
k <- components[[i]][2]
j <- components[[i]][1] - k
powers[[i]] <- powers[[j]] %*% powers[[k]]
}
powers[[n]]
}
#
# Create the transition matrix for the regular expression "^.*A[^A][^A]+A.*$"
# where "A" is from an alphabet of `L` letters.
#
fsa <- function(L) {
p <- 1/L
q <- 1-p
matrix(c(q,0,0,0,0,
p,p,p,0,0,
0,q,0,0,0,
0,0,q,q,0,
0,0,0,p,1), 5)
}
#
# Tabulate the results.
#
f <- Vectorize(function(N, L, components) {
x <- fsa(L)
x.N <- matrix.power(x, N, components)
L^N * x.N[1,5]
}, c("N", "L"))
N <- 4:12; names(N) <- paste0("n=", N)
L <- 2:5; names(L) <- paste0("l=", L)
components=power.optimize(max(N))
results <- outer(N, L, f, components=components)
results
More code
The following checks the answer via a complete enumeration of the non-matching strings. Its output, which is almost instantaneous, is the same as the previous.
It employs a simplification: since the regular expression distinguishes only between "A" and non-"A", the calculation can be carried out over a binary alphabet (representing these two classes of letters) where the letters are now chosen with probabilities $1/L$ and $1-1/L,$ respectively. Thus its time and space demands are only those of a binary alphabet rather than one with all $L$ letters. It can compute the full table (out to $N=25$ and $L=10$) in a minute or two.
This solution is very flexible: it can be applied to any regular expression (perhaps with minor modifications for more complex expressions).
#
# Enumeration.
#
enumerate <- Vectorize(function(N, L, prob, r="^.*A[^A][^A]+A.*$", alphabet=c("A","B")) {
if (missing(prob)) {
prob <- c(1,L-1)/L
alphabet <- alphabet[1:2]
}
non.match <- "" # The non-matching prefixes
p <- 1 # Their probabilities
for (i in 1:N) {
words <- c(outer(non.match, alphabet, paste0)) # Lengthen all non-matches
p <- c(outer(p, prob)) # Compute their probabilities
i <- !grepl(r, words) # Search for the pattern
non.match <- words[i] # Eliminate matches
p <- p[i] # Eliminate their probabilities
}
L^N * (1 - sum(p)) # Count the matches
}, c("N", "L"))
N <- 4:12; names(N) <- paste0("n=", N)
L <- 2:5; names(L) <- paste0("l=", L)
outer(N, L, enumerate)
References
Aho, Sethi, and Ullman, Compilers: Principles, Techniques, and Tools, First Edition.
Best Answer
Let $a_n$ be the number of strings of length $n$ with no tr's. An $(n-1)$-long string can be extended in $26$ ways unless it ends in "t", in which case it can only be extended in $25$ ways. So $$a_n= 26a_{n-1}-a_{n-2}$$ for $n\ge2$; the initial conditions are $a_0=1$ and $a_1=26$.
You can give a formula for $a_n$, but to compute $a_7$ it suffices to run out the recurrence. The final answer is $26^7-a_7 = 71,112,600$.
In case there's doubt, another way to verify the recurrence is to write down the $26\times26$ transition matrix for letters. All of the entries are $1$'s except for a single $0$ off the diagonal (corresponding to the prohibited tr). The characteristic polynomial is computed to be $z^{24}(z^2-26z+1)$.