I've been racking my brain for a couple of days to work out a series or closed-form equation to the following problem:
Specifically: given all strings of length $N$ that draws from an alphabet of $L$ letters (starting with 'A', for example {A, B}, {A, B, C}, …), how many of those strings contain a substring that matches the pattern: 'A', more than 1 not-'A', 'A'. The standard regular expression for that pattern would be A[^A][^A]+A
.
The number of possible strings is simple enough: $L^N$ . For small values of $N$ and $L$, it's also very practical to simply create all possible combinations and use a regular expression to find the substrings that match the pattern; in R:
all.combinations <- function(N, L) {
apply(
expand.grid(rep(list(LETTERS[1:L]), N)),
1,
paste,
collapse = ''
)
}
matching.pattern <- function(N, L, pattern = 'A[^A][^A]+A') {
sum(grepl(pattern, all.combinations(N, L)))
}
all.combinations(4, 2)
matching.pattern(4, 2)
I had come up with the following, which works for N < 7:
$$M(N, L) = \sum_{g=2}^{N-2} (N – g – 1) \cdot (L – 1)^g \cdot L^{(N – g -2)}$$
Unfortunately, that only works while N < 7 because it's simply adding the combinations that have substrings A..A, A…A, A….A, etc. and some combinations obviously have multiple matching substrings (e.g., A..A..A, A..A…A), which are counted twice.
For example: when N = 4 and L = 2 ({A, B}), there are $L^N = 16$ possible strings: AAAA, BAAA, ABAA, BBAA, AABA, BABA, ABBA, BBBA, AAAB, BAAB, ABAB, BBAB, AABB, BABB, ABBB, BBBB
. Only one of those strings matches the pattern 'A', more than 1 not-'A', 'A': ABBA
. If you are familiar with regular expression syntax, the pattern is expressed as A[^A][^A]+A
.
When N = 4, and L = 3 ({A, B, C}), there are $L^N = 81$ possible strings, and there are 4 strings that match the pattern: ABBA, ABCA, ACBA, ACCA
.
When N = 6, and L = 2, there are $L^N = 64$ combinations, and 17 strings that match the pattern: ABBAAA, AABBAA, BABBAA, ABBBAA, ABBABA, AAABBA, BAABBA, ABABBA, BBABBA, AABBBA, BABBBA, ABBBBA, ABBAAB, AABBAB, BABBAB, ABBBAB, ABBABB
Any suggestions?
For what it is worth, here's the number of combinations, and matching combinations for some values of N and L that are tractable to determine by generating all combinations and doing a regular expression match:
N L combinations matching
-- - ------------ --------
4 2 16 1
5 2 32 5
6 2 64 17
7 2 128 48
8 2 256 122
9 2 512 290
10 2 1024 659
4 3 81 4
5 3 243 32
6 3 729 172
7 3 2187 760
8 3 6561 2996
9 3 19683 10960
10 3 59049 38076
4 4 256 9
5 4 1024 99
6 4 4096 729
7 4 16384 4410
8 4 65536 23778
9 4 262144 118854
10 4 1048576 563499
Best Answer
There are many ways to solve this problem. The following one is of particular statistical interest. At the end it is verified with a brute-force enumeration which itself works remarkably well.
All regular expressions are implemented as Finite State Automata (FSA). The regular expression in the question,
A[^A][^A]+A
, can be a little more explicitly written in the formcorresponding to this FSA:
The letters of the word are fed to the FSA in order. Transitions are made along arrows labeled by the letters: "A" for an "A" and "A'" for any non-"A". The states can be described as
State $1$ is the initial state.
State $2$ occurs as soon as the first "A" is encountered.
State $3$ records observing a subsequent non-"A".
State $4$ records observing two successive non-"A" characters.
State $5$ is arrived at upon observing another "A." It is the terminal state.
We say the FSA "accepts" a word when its final state is $5.$
The number of words of length $N$ accepted by any FSA is the total of all possible words of length $N$ (namely, $L^N$) times the probability that a random word selected from all possible ones is accepted. Such a random word is constructed from a sequence of independent uniformly distributed letters. Feeding such a sequence to the FSA produces a Markov Chain.
The transition matrix $\mathbb P = (p_{ij})$ for a Markov chain gives the chance of making a transition from state $i$ to state $j.$ We can read these chances directly off the graph of the FSA, because, no matter what the state might be, the uniform selection of letters implies
The chance of observing an "A" is $1/L = p$ and
The chance of observing a non-"A" is $1-1/L = 1-p = q.$
Consequently, using the indexing of states as shown in the figure, the transition matrix is
$$\mathbb{P} = \pmatrix{ q & p & 0 & 0 & 0 \\ 0 & p & q & 0 & 0 \\ 0 & p & 0 & q & 0 \\ 0 & 0 & 0 & q & p \\ 0 & 0 & 0 & 0 & 1}$$
Standard techniques of linear algebra enable us to compute the effect of making $n$ transitions in the FSA, which (because all are independent) is given by the matrix $\mathbb{P}^n.$ Its $(1,5)$ entry is--by definition--the chance that the Markov Chain is in the terminal state ($5$) after $n$ transitions. Linear algebra tells us the formula will be a linear combination of $N^\text{th}$ powers of the eigenvalues of $\mathbb{P}$ whose coefficients are polynomials in $N.$
FWIW, for this particular FSA the formula is given by the following
R
functionf
; the variablese0
ande1
(as well as1
,l
, andl-1
) are directly proportional to those eigenvalues:(I found this through an elementary combinatorial analysis of the regular expression.) It produces the same output as the functions
f
andenumerate
presented below. However, this is a relatively uninteresting solution because it is specific to the particular regular expression of the question. The remainder of this answer discusses approaches that generalize.It is practicable to compute the powers of $\mathbb{P}$ by brute force. For instance, the
R
program below tabulates all the answers for $N \le 25$ and $L \le 10$ within 0.01 seconds. Obtaining all answers for $N\le 500$ for any given $L$ takes only a second--and almost all of that effort consists of a one-time calculation to determine the fastest way to compute the powers. It's not worth using double-precision floats for the computation when $N \gt 1000,$ because even when $L=2$ the values overflow.Here's a piece of the output. The smaller parts agree with the tabulation in the question (and with my own independent tabulation).
Code
In principle, one could write a program to produce the transition matrix $\mathbb P$ given any regular expression and alphabet length. That's too much to tackle here, so I just hard-coded the transition matrix in the function
fsa
. Preliminary functionspower.optimize
andmatrix.power
provide a mechanism to compute matrix powers quickly: they are optimized for tabulating a large number of results involving relatively small powers of many different matrices. (The former can take considerable time for $N\gg 500,$ but that's a problem requiring extended-precision or exact arithmetic anyway.) The actual work, performed by the functionf
, amounts to computing $\mathbb{P}$ (which depends on $L$), taking its $N^\text{th}$ power, and returning its $(1,5)$ entry.More code
The following checks the answer via a complete enumeration of the non-matching strings. Its output, which is almost instantaneous, is the same as the previous.
It employs a simplification: since the regular expression distinguishes only between "A" and non-"A", the calculation can be carried out over a binary alphabet (representing these two classes of letters) where the letters are now chosen with probabilities $1/L$ and $1-1/L,$ respectively. Thus its time and space demands are only those of a binary alphabet rather than one with all $L$ letters. It can compute the full table (out to $N=25$ and $L=10$) in a minute or two.
This solution is very flexible: it can be applied to any regular expression (perhaps with minor modifications for more complex expressions).
References
Aho, Sethi, and Ullman, Compilers: Principles, Techniques, and Tools, First Edition.