The probability that a sequence of $M$ random letters contains a sub-sequence of $K$ letters

combinatoricsprobability

The alphabet is supposed to have $n$ characters (usually $n=26$, though in my case $n=256$ or $256^2$ or $256^4$).

Example:

Here $n=2$ and the alphabet consists of two letters A and B. The prespecified sub-sequence is ABB (so $K=3$). The sequence contains $M=4$ letters. The $M^n$ possibilities for the sequence are:

AAAA, AAAB, AABA, AABB, ABAA, ABAB, ABBA, ABBB, BAAA, BAAB, BABA, BABB, BBAA, BBAB, BBBA, and BBBB.

The sub-sequence ABB appears (highlighted in bold) 4 times out of 16, and in each of these four cases, it appears only once within the sequence. So the answer is 1/4 in this case. Note that for the sub-sequence AAA, the answer would be 3/16.

In my case, $M$ is large (say, $10^7$) and $K$ is small (say $K=300$.) Both the sequence and sub-sequence are random in my case.

Best Answer

As far as I know, you can compute the complement probability (i.e. the probability of the sub-sequence not appearing in the sequence) using the Goulden-Jackson cluster method. The original paper is this one, but you may be able to find more practical explanations online.