Probability of a random generated string containing more than $m$ equal characters

combinationscombinatoricsprobability

For a randomly generated character sequence of length $k$ containing only characters from a fixed set of length $n$ (e.g. alphabet), what is the probability that it contains at least $m$ equal characters?

Original problem: What is the probability that a randomly generated password of length $k = 10$ only consisting of lowercase letters and digits ($n = 26 + 10 = 36$) will contain any character for at least $m = 5$ times? For example, the password aa91abcada contains the character 'a' exactly $5$ times.

I suppose that if we define random variable $X$ as the number of equal characters in a sequence of length n, then the problem boils down to computing:
$$P(X\ge m) = P(X=m) + P(X=m+1) + \cdots + P(X=k)$$
Thus we only need to find a formula for computing $P(X=m)$ for $m\in\lbrace1, 2,\ldots,k\rbrace$.

I managed to come up with the following formula:
$$P(X=m) = \frac{\text{number of satisfying sequences}}{\text{ number of all possible sequences}} = \frac{n\binom{(m + 1)(k-m)}{(k-m)}(n-1)^{k-m}}{n^k}$$

where the 3 multipliers in the numenator have the following meanings:

  • we choose a fixed character that repeats m times (we do this for all $n$ characters)
  • we can place the remaining $k – m$ characters between any of the fixed characters, at the beginning, or at the end. Thus we want to pick $k – m$ positions out of all possible $(m + 1)(k – m)$ positions (there are $m + 1$ "spaces" between fixed characters and in each one of them there can possibly be $k – m$ characters).
  • each of the remaining $(k – m)$ characters can be any of the remaining $n – 1$ characters in our alphabet

However, I have a strong suspicion that this formula (if correct) works only for for $m > \frac{k}{2}$.

Is the above formula correct? If not, is there a general formula for this kind of problem?

Best Answer

Okay, here's my attempt. I'm not 100% sure of this either, but maybe it can offer a new perspective. Instead of computing the sums of all $P(X=m)+P(X=m+1)...$, you can directly compute the probability for $X\geq m$.

Using the case provided in the question with $k=10$, $n=36$, and $m=5$, we can create the following string that uses $a$ to represent the fixed character and $b$ to represent all other characters: $$aaaaabbbbb$$ In this scenario, there are $m=5$ $a$'s which have one possible value. On the other hand, the $k-m=10-5=5$ $b$'s can be all 36 different characters. We don't need to exclude the fixed character from $b$ because we're looking for $P(X\geq m)$ and not $P(X=m)$. Hence, the number of outcomes that fulfill the parameters given for that single string is: $$1\times 1\times 1\times 1\times 1\times 36\times 36\times 36\times 36\times 36=36^5=n^{k-m}$$ Next, we multiply by the ${k\choose m}={10\choose5}=252$ different arrangements of strings with 5 $a$'s and 5 $b$'s: $${36^5{10\choose 5}}=n^{k-m} {k\choose m}$$ Finally, there are 36 possible characters that the fixed character can be, hence, we multiply by 36: $$36^{5+1}{10\choose 5}=n^{k-m+1}{k\choose m}$$ Now divide the satisfying outcomes over the total possible outcomes to get the final probability: $$P(X\geq 5)=\frac{36^{5+1}{10\choose 5}}{36^{10}}\approx0.015\%$$ As a general rule: $$P(X\geq m)=\frac{n^{k-m+1}{k\choose m}}{n^k}$$ And then for fun, we can make a general rule for $P(X=m)$ by excluding the fixed character from the $b$ letters: $$P(X=m)=\frac{n(n-1)^{k-m}{k\choose m}}{n^k}$$ Again, I'm not 100% confident in this solution but it seems to make sense to me.

Related Question