Combinatorics – Counting All Possibilities That Contain a Substring

combinatoricspermutations

How many strings are there of seven lowercase letters that have the substring tr in them?

So I am having a little problem with this question, I know that the total number of combinations is $26^6$ but there is double counting on some of the combinations.

For example, when you have a case that contains multiples 'tr' then it will be counted multiple times depending on the location of tr, even though its the same string.

t r t r t r a

Any advice on what to subtract to remove this double counting?

Thanks for any help!

Best Answer

Let $a_n$ be the number of strings of length $n$ with no tr's. An $(n-1)$-long string can be extended in $26$ ways unless it ends in "t", in which case it can only be extended in $25$ ways. So $$a_n= 26a_{n-1}-a_{n-2}$$ for $n\ge2$; the initial conditions are $a_0=1$ and $a_1=26$.

You can give a formula for $a_n$, but to compute $a_7$ it suffices to run out the recurrence. The final answer is $26^7-a_7 = 71,112,600$.

In case there's doubt, another way to verify the recurrence is to write down the $26\times26$ transition matrix for letters. All of the entries are $1$'s except for a single $0$ off the diagonal (corresponding to the prohibited tr). The characteristic polynomial is computed to be $z^{24}(z^2-26z+1)$.

Related Solutions

[Math] How to determine the number of possible combinations of letters that contain a degenerate substring

There are many ways to solve this problem. The following one is of particular statistical interest. At the end it is verified with a brute-force enumeration which itself works remarkably well.

All regular expressions are implemented as Finite State Automata (FSA). The regular expression in the question, A[^A][^A]+A, can be a little more explicitly written in the form

^.*A[^A][^A]+A.*$

corresponding to this FSA:

The letters of the word are fed to the FSA in order. Transitions are made along arrows labeled by the letters: "A" for an "A" and "A'" for any non-"A". The states can be described as

State $1$ is the initial state.
State $2$ occurs as soon as the first "A" is encountered.
State $3$ records observing a subsequent non-"A".
State $4$ records observing two successive non-"A" characters.
State $5$ is arrived at upon observing another "A." It is the terminal state.

We say the FSA "accepts" a word when its final state is $5.$

The number of words of length $N$ accepted by any FSA is the total of all possible words of length $N$ (namely, $L^N$) times the probability that a random word selected from all possible ones is accepted. Such a random word is constructed from a sequence of independent uniformly distributed letters. Feeding such a sequence to the FSA produces a Markov Chain.

The transition matrix $\mathbb P = (p_{ij})$ for a Markov chain gives the chance of making a transition from state $i$ to state $j.$ We can read these chances directly off the graph of the FSA, because, no matter what the state might be, the uniform selection of letters implies

The chance of observing an "A" is $1/L = p$ and
The chance of observing a non-"A" is $1-1/L = 1-p = q.$

Consequently, using the indexing of states as shown in the figure, the transition matrix is

$$\mathbb{P} = \pmatrix{ q & p & 0 & 0 & 0 \\ 0 & p & q & 0 & 0 \\ 0 & p & 0 & q & 0 \\ 0 & 0 & 0 & q & p \\ 0 & 0 & 0 & 0 & 1}$$

Standard techniques of linear algebra enable us to compute the effect of making $n$ transitions in the FSA, which (because all are independent) is given by the matrix $\mathbb{P}^n.$ Its $(1,5)$ entry is--by definition--the chance that the Markov Chain is in the terminal state ($5$) after $n$ transitions. Linear algebra tells us the formula will be a linear combination of $N^\text{th}$ powers of the eigenvalues of $\mathbb{P}$ whose coefficients are polynomials in $N.$

FWIW, for this particular FSA the formula is given by the following R function f; the variables e0 and e1 (as well as 1, l, and l-1) are directly proportional to those eigenvalues:

f <- Vectorize(function(n, l) {
  if (l==3) {
    (2 * (-1)^n - (56 + 15*n + 9*n^2) * 2^n + 54 * 3^n) / 54
  } else {
    delta <- sqrt(4*l - 3)
    e0 <- (1 - delta)/2
    e1 <- (1 + delta)/2
    (((e1^n - e0^n) * (3 + 2*l))/(2 * (e0 - e1)) - 3/2 * (e1^n + e0^n) + 
        (l - 3)^2 * l^n - (l - 1)^n * (6 - 3*n + l * (l + n - 6))) / (l - 3)^2
  }
})

(I found this through an elementary combinatorial analysis of the regular expression.) It produces the same output as the functions f and enumerate presented below. However, this is a relatively uninteresting solution because it is specific to the particular regular expression of the question. The remainder of this answer discusses approaches that generalize.

It is practicable to compute the powers of $\mathbb{P}$ by brute force. For instance, the R program below tabulates all the answers for $N \le 25$ and $L \le 10$ within 0.01 seconds. Obtaining all answers for $N\le 500$ for any given $L$ takes only a second--and almost all of that effort consists of a one-time calculation to determine the fastest way to compute the powers. It's not worth using double-precision floats for the computation when $N \gt 1000,$ because even when $L=2$ the values overflow.

Here's a piece of the output. The smaller parts agree with the tabulation in the question (and with my own independent tabulation).

      l=2    l=3      l=4       l=5
n=4     1      4        9        16
n=5     5     32       99       224
n=6    17    172      729      2096
n=7    48    760     4410     16128
n=8   122   2996    23778    110672
n=9   290  10960   118854    704288
n=10  659  38076   563499   4251248
n=11 1451 127464  2570769  24686912
n=12 3123 415236 11395539 139221648

Code

In principle, one could write a program to produce the transition matrix $\mathbb P$ given any regular expression and alphabet length. That's too much to tackle here, so I just hard-coded the transition matrix in the function fsa. Preliminary functions power.optimize and matrix.power provide a mechanism to compute matrix powers quickly: they are optimized for tabulating a large number of results involving relatively small powers of many different matrices. (The former can take considerable time for $N\gg 500,$ but that's a problem requiring extended-precision or exact arithmetic anyway.) The actual work, performed by the function f, amounts to computing $\mathbb{P}$ (which depends on $L$), taking its $N^\text{th}$ power, and returning its $(1,5)$ entry.

#
# Find the fastest way to compute powers up to `n`.
# Returns a data structure useful for performing the computation.
#
power.optimize <- function(n) {
  components <- as.list(1:n)
  combine <- function(i, j) union(components[[i]], components[[j]])

  if (n > 1) {
    for (k in 2:n) {
      cost <- sapply(1:(k-1), function(i) 1 + length(combine(i, k-i)))
      i <- which.min(cost)
      components[[k]] <- c(k, combine(k-i, i))
    }
  }
  components
}
#
# Compute a positive integral power of a matrix.
#
matrix.power <- function(x, n, components) {
  if (missing(components)) components <- power.optimize(n)

  powers <- list()
  powers[[1]] <- x
  for (i in rev(setdiff(components[[n]], 1))) {
    k <- components[[i]][2]
    j <- components[[i]][1] - k
    powers[[i]] <- powers[[j]] %*% powers[[k]]
  }

  powers[[n]]
}
#
# Create the transition matrix for the regular expression "^.*A[^A][^A]+A.*$"
# where "A" is from an alphabet of `L` letters.
#
fsa <- function(L) {
  p <- 1/L
  q <- 1-p
  matrix(c(q,0,0,0,0,
           p,p,p,0,0,
           0,q,0,0,0,
           0,0,q,q,0,
           0,0,0,p,1), 5)
}
#
# Tabulate the results.
#
f <- Vectorize(function(N, L, components) {
  x <- fsa(L)
  x.N <- matrix.power(x, N, components)
  L^N * x.N[1,5]
}, c("N", "L"))

N <- 4:12; names(N) <- paste0("n=", N)
L <- 2:5; names(L) <- paste0("l=", L)
components=power.optimize(max(N))
results <- outer(N, L, f, components=components)
results

More code

The following checks the answer via a complete enumeration of the non-matching strings. Its output, which is almost instantaneous, is the same as the previous.

It employs a simplification: since the regular expression distinguishes only between "A" and non-"A", the calculation can be carried out over a binary alphabet (representing these two classes of letters) where the letters are now chosen with probabilities $1/L$ and $1-1/L,$ respectively. Thus its time and space demands are only those of a binary alphabet rather than one with all $L$ letters. It can compute the full table (out to $N=25$ and $L=10$) in a minute or two.

This solution is very flexible: it can be applied to any regular expression (perhaps with minor modifications for more complex expressions).

#
# Enumeration.
#
enumerate <- Vectorize(function(N, L, prob, r="^.*A[^A][^A]+A.*$", alphabet=c("A","B")) {
  if (missing(prob)) {
    prob <- c(1,L-1)/L
    alphabet <- alphabet[1:2]
  }
  non.match <- "" # The non-matching prefixes
  p <- 1          # Their probabilities
  for (i in 1:N) {
    words <- c(outer(non.match, alphabet, paste0))   # Lengthen all non-matches
    p <- c(outer(p, prob))                           # Compute their probabilities
    i <- !grepl(r, words)                            # Search for the pattern
    non.match <- words[i]                            # Eliminate matches
    p <- p[i]                                        # Eliminate their probabilities
  }
  L^N  * (1 - sum(p))                                # Count the matches
}, c("N", "L"))


N <- 4:12; names(N) <- paste0("n=", N)
L <- 2:5; names(L) <- paste0("l=", L)
outer(N, L, enumerate)

References

Aho, Sethi, and Ullman, Compilers: Principles, Techniques, and Tools, First Edition.

Number of 6-character passwords containing at least one number – why is the answer wrong

For example, the string 11AAAA is counted twice in your solution, once when your "minimum guaranteed digit" is the first one, and once -- when it is the second one...