Find probability of 5 letter word appearing in 1800 character long random string

probability

So say there is 26 possible characters and you've got 1800 long random string. Now how would you go about finding the chance of there being a specific 5 letter word within?

I already found a related question to this but the given formula in the answer doesn't seem to make sense: Probability of a four letter word from a sequence of n random letters

According to the formula in the accepted answer, the denominator would get really big 26^1800 with 1800 length string and mean very little probability. But shouldn't probability increase when the random string length is increased instead of decrease?

Best Answer

This is a long-winded comment that is being posted as an answer only for legibility.

The only approach that I am aware of is Inclusion-Exclusion which results in very ugly math.

I will make the simplying assumption that the word is composed of 5 distinct characters. My approach will be

$$\frac{N\text{(umerator)}}{D\text{(enominator)}}$$

where

$$D = (26)^{(1800)}.$$

There are $(1801 - 5) = 1796$ slots that the word can begin in. For each slot, there are $(1800 - 5) = 1795$ unconstrained letters, which can be chosen in $(26)^{(1795)}$ ways.

Therefore, I will set

$$N_1 = (1801 - 5) \times (26)^{(1800 - 5)}.$$

The eventual answer will be something like

$$N = N_1 - N_2 + N_3 - N_4 + \cdots.$$

To compute $N_2$, it would be wrong to presume that the computation is

$$N_2 = (1801 - [1 \times 5]) \times (1801 - [2 \times 5]) \times (26)^{(1800 - [2 \times 5])}.$$

The reason that the above approach is wrong is that it presumes that regardless of the placement of one of the words, there will automatically be 1791 possible locations for the start of the 2nd word.

If the first word happens to start in position 1 or position 1796, then yes there will be 1791 possible locations for the 2nd word. However, as the first letter of the first word changes location from 1 to 2 to 3 to 4 to 5, the possible locations for the start of the second word decrease by 1.

For example, if the first word begins on position 5, then the second word can not begin before position 10. Note that by the simplifying assumption that the 5 letters in the word are distinct, the two words can not overlap.

If the first letter of the first word is located anywhere between positions 5 and 1792 inclusive, (1788 positions) then there will be (1796 - 9) = 1787 possible starting points for the second word.

If the first word starts in position $i$, for $i \in \{1,2,3,4\}$ then there will be $(1792 - i)$ possible starting positions for the second word. By symmetry, similar considerations apply if the first word starts on or after position 1793.

Let

$$T_2 = \left[2 \times (1788 + 1789 + 1790 + 1791)\right] ~+~ \left[1788 \times 1787 \right].$$

Then the correct computation is

$$N_2 = T_2 \times (26)^{(1800 - [2 \times 5])}.$$

The math for $N_3$ will get uglier still than the math for $N_2$. This is because in addition to being concerned about the starting position of the leftmost word and the rightmost word, you also have to be concerned about the gap between two of the words. This will also affect how many positions the third word can start in.

Similarly when considering $N_4, N_5, \cdots$ you have to be concerned about each of the gaps between words.

Told you the math was ugly.

If somebody has a better way of doing this that in no way involves a computer algorithm, I'd like to know about it.

Related Question