Probability – Probability of a Four Letter Word from a Sequence of n Random Letters

probability

I know this is easy, but my high school maths has failed me.

Question: I generate an 8 letter random string. What is the probability that within these 8 letters I will find a particular 4 letter word?

Each letter is A-Z. Repeats are allowed. What are the chances my string will contain the word "ABCD" for example?

EDIT: To clarify.. I do care about ordering. I want to know if "ABCD" occurs within my randomly generated string. But I don't care if "ABDC" occurs. In other words I want to know the probability of A followed by B followed by C followed by D occurring.

If it helps to understand why I am asking this. Our software generates 8 letter random strings which are used for logins to a web site. Very occasionally, these 8 letter random strings contain swear words. I want to know how to calculate the probability of a particular swear word occurring in a randomly generated 8 letter string.

Best Answer

Marc's answer is almost correct, but for one point; the numerator counts the occurrence of ABCDABCD twice. (As Graham Kemp pointed out. )


To elaborate: if I let $n(i)$ be the number of string patterns with ABCD starting at the $i$-th letter, then

$n(1) = n(2) = n(3) = n(4) = n(5) = 26^4$

(and $n(6) = n(7) = n(8) = 0$ )

but as ABCDABCD is counted in both $n(1)$ and $n(5)$, we must subtract by 1 to compensate.

So the end result would be

$$\frac{\left(\sum\limits_{i=1}^8{n(i)}\right)-1}{26^8}=\frac{5\times26^4-1}{26^8}$$ which is almost, but not quite, equal to the aforementioned answer. (Albeit probably close enough for realistic purposes...)


On the other hand, if we were looking for the expected value of the occurrence of ABCD, then we shouldn't need to subtract to compensate. (I think)


For longer strings, we would need to subtract duplicates, then add back to compensate for triplets, and so on, and the general formula I'm not sure how to write...

Related Question