What is the probability of a word n characters long appearing in a string of m characters, in an alphabet of x characters? A word here is simply a string of characters contained in another string of characters. A string of characters of length L is an ordered L-tuple of characters, where characters are members of a set A called an alphabet. An alphabet contains n characters if it has n members that are all mutually not equal.IMPORTANT: If the alphabet in question has x characters, each character in a string has exactly 1/x probability of being a given character.
[Math] probability of a word in a string
combinatoricsprobabilitypuzzle
Related Solutions
Okay, here's my attempt. I'm not 100% sure of this either, but maybe it can offer a new perspective. Instead of computing the sums of all $P(X=m)+P(X=m+1)...$, you can directly compute the probability for $X\geq m$.
Using the case provided in the question with $k=10$, $n=36$, and $m=5$, we can create the following string that uses $a$ to represent the fixed character and $b$ to represent all other characters: $$aaaaabbbbb$$ In this scenario, there are $m=5$ $a$'s which have one possible value. On the other hand, the $k-m=10-5=5$ $b$'s can be all 36 different characters. We don't need to exclude the fixed character from $b$ because we're looking for $P(X\geq m)$ and not $P(X=m)$. Hence, the number of outcomes that fulfill the parameters given for that single string is: $$1\times 1\times 1\times 1\times 1\times 36\times 36\times 36\times 36\times 36=36^5=n^{k-m}$$ Next, we multiply by the ${k\choose m}={10\choose5}=252$ different arrangements of strings with 5 $a$'s and 5 $b$'s: $${36^5{10\choose 5}}=n^{k-m} {k\choose m}$$ Finally, there are 36 possible characters that the fixed character can be, hence, we multiply by 36: $$36^{5+1}{10\choose 5}=n^{k-m+1}{k\choose m}$$ Now divide the satisfying outcomes over the total possible outcomes to get the final probability: $$P(X\geq 5)=\frac{36^{5+1}{10\choose 5}}{36^{10}}\approx0.015\%$$ As a general rule: $$P(X\geq m)=\frac{n^{k-m+1}{k\choose m}}{n^k}$$ And then for fun, we can make a general rule for $P(X=m)$ by excluding the fixed character from the $b$ letters: $$P(X=m)=\frac{n(n-1)^{k-m}{k\choose m}}{n^k}$$ Again, I'm not 100% confident in this solution but it seems to make sense to me.
This is a long-winded comment that is being posted as an answer only for legibility.
The only approach that I am aware of is Inclusion-Exclusion which results in very ugly math.
I will make the simplying assumption that the word is composed of 5 distinct characters. My approach will be
$$\frac{N\text{(umerator)}}{D\text{(enominator)}}$$
where
$$D = (26)^{(1800)}.$$
There are $(1801 - 5) = 1796$ slots that the word can begin in. For each slot, there are $(1800 - 5) = 1795$ unconstrained letters, which can be chosen in $(26)^{(1795)}$ ways.
Therefore, I will set
$$N_1 = (1801 - 5) \times (26)^{(1800 - 5)}.$$
The eventual answer will be something like
$$N = N_1 - N_2 + N_3 - N_4 + \cdots.$$
To compute $N_2$, it would be wrong to presume that the computation is
$$N_2 = (1801 - [1 \times 5]) \times (1801 - [2 \times 5]) \times (26)^{(1800 - [2 \times 5])}.$$
The reason that the above approach is wrong is that it presumes that regardless of the placement of one of the words, there will automatically be 1791 possible locations for the start of the 2nd word.
If the first word happens to start in position 1 or position 1796, then yes there will be 1791 possible locations for the 2nd word. However, as the first letter of the first word changes location from 1 to 2 to 3 to 4 to 5, the possible locations for the start of the second word decrease by 1.
For example, if the first word begins on position 5, then the second word can not begin before position 10. Note that by the simplifying assumption that the 5 letters in the word are distinct, the two words can not overlap.
If the first letter of the first word is located anywhere between positions 5 and 1792 inclusive, (1788 positions) then there will be (1796 - 9) = 1787 possible starting points for the second word.
If the first word starts in position $i$, for $i \in \{1,2,3,4\}$ then there will be $(1792 - i)$ possible starting positions for the second word. By symmetry, similar considerations apply if the first word starts on or after position 1793.
Let
$$T_2 = \left[2 \times (1788 + 1789 + 1790 + 1791)\right] ~+~ \left[1788 \times 1787 \right].$$
Then the correct computation is
$$N_2 = T_2 \times (26)^{(1800 - [2 \times 5])}.$$
The math for $N_3$ will get uglier still than the math for $N_2$. This is because in addition to being concerned about the starting position of the leftmost word and the rightmost word, you also have to be concerned about the gap between two of the words. This will also affect how many positions the third word can start in.
Similarly when considering $N_4, N_5, \cdots$ you have to be concerned about each of the gaps between words.
Told you the math was ugly.
If somebody has a better way of doing this that in no way involves a computer algorithm, I'd like to know about it.
Best Answer
The answer is it depends in the sequence. Suppose we want to find the probability that the word 11 is inside a 3-letter word in an alphabet of 3 letters.There are 5 such words that contain it. Now lets look at the probability the word 12 is inside a 3-letter word. there are 6 words that contain it. Not all words have the same probability.
Now for a fixed word:
let w be your m-letter string in a k letter alphabet. Let $x_n$ be the number of words of lenth $n$ (also in x letter alphabet) containing w.
At first sight we would consider the following recurrence relation: $x_n=kx_{n-1}+k^{n-m}-x_{n-m}$ Since every word of length n can be created in the following way: for a n-1 word that contains w just add any of the k letters at the end $kx_{n-1}$. For a $k-n$length word just add w at the end to get one that does work. This seems really great. There's just one problem:
Sometimes when you add w at the end of a n-k word that doesn't contain w you get a word that contains w even if you remove some of the last digits. Let me explain: for example consider $w=1,2,1$ then the word $3,1,2$ clearly doesn't contain $w$ so we add w to this letter to get $3,1,2,1,2,1$ notice that this word is obtained through both the process of taking an (n-1) word that contains w and taking and n-k word that does not. Thus we are counting it twice? Can you refine the recursion so it works?