The specific word problem that motivated this question was:
Generate random numbers 0-9 uniformly. Define $W$ to be the number of trials required for at least one 4, at least one 5, and at least one 6 to show up.
For example, if the generated sequence is $444444440980981245550986$, then $W=24$.
More generally:
Question 1: Given a Multinoulli random variable (e.g. a fair die) with probability distributed uniformly between $K$ categories. Define $W$ to be the number of trials required for some subset $1 \le k \le K$ of the categories to all appear at least once. What is the distribution of $W$?
Possible solution: Calculating $\mathbb{P}(W \le n)$ corresponds (I think) to calculating
- all length $n$ sequences in which at least one of each category shows up (i.e. in $k$ of the trials), and then the sequences can do whatever they want in the remaining trials
and then dividing this number by
- the number of all possible length $n$ sequences.
So we can break this up into sub-problems as follows:
I think the first number is (according to the fundamental counting principle):
$$ \sum_{m=k}^n \sum_{x_1 + \dots + x_k = m \,, \\ \neg(x_1 = \dots = x_k = 0)} \binom{n}{x_1, x_2, \dots, x_k} (K – k)^{n-m} $$
and the second number is $K^n$, so is the probability just:
$$ \mathbb{P}(W \le n) \overset{?}{=} \sum_{m=k}^n \sum_{x_1 + \dots + x_k = m \,, \\ \neg(x_1 = \dots = x_k = 0)} \binom{n}{x_1, x_2, \dots, x_k} \left(\frac{K – k}{K} \right)^{n-m} \left(\frac{1}{K}\right)^m \,? $$
Question 2: If the above formula is correct, is there a way to simplify it?
It does appear to simplify to the geometric CDF for $p= \frac{1}{2}$ (take $K=2$ and $k=1$). A simplification which held more generally would be nice.
EDIT: It occurred to me that always $m >0$, since $m \ge k$, but $x_1 = \dots = x_k = 0$ would imply $x_1 + \dots + x_k = m = 0$, a contradiction. Therefore the second condition in the sum is redundant, so the above expression simplifies to:
$$\sum_{m=k}^n \sum_{x_1 + \dots + x_k = m } \binom{n}{x_1, x_2, \dots, x_k} \left(\frac{K – k}{K} \right)^{n-m} \left(\frac{1}{K}\right)^m \,, $$
which we can get a simplified closed form for if we can simplify
$$\sum_{x_1 + \dots + x_k = m } \binom{n}{x_1, x_2, \dots, x_k} \quad \text{for }k \le m \le n \,. $$
When $m=n$ the multinomial theorem gives us that this sum is equal to $k^n$. But I don’t know about more generally.
Progress so far: In the case that $k=1$, this reduces to a regular geometric distribution, since we can "clump" all of the remaining $K-1$ categories into one super-category.
However, for $k>1$, this trick no longer seems to work. This is because the events of any of the categories in the $k$ categories showing up do not seem to be independent. E.g. in the example above, if one generates a $4$, one also knows that one did not generate a $6$.
And it isn't sufficient to just calculate the distribution that any one of categories appears $k$ times, since one could have the same category appearing $k$ times without any of the categories of interest appearing at all, e.g. $444$ in the above example.
It seems like it might be easier to calculate the probabilities:
$$\mathbb{P}(W > n)\,, \quad n \ge k \,. $$
This would allow us to get the distribution, since:
$$F_W = \begin{cases} 0 & 0 \le n < k \\ 1 – \mathbb{P}(W > n) \end{cases} $$
and because $W$ is non-negative we even get the expectation for free:
$$\mathbb{E}[W] = \sum_{n=0}^{\infty} \mathbb{P}(W > n) \,. $$
The only thoughts I've had for calculating $\mathbb{P}(W =k )$ so far are:
- find a recurrence relation for $\mathbb{P}(W >n)$ in terms of $\mathbb{P}(W > n-1)$ (and possibly also $\mathbb{P}(W > n-2)$), and then try to derive a probability generating function from this somehow.
- Brute force calculate the number of sequences of interest using some combinatorial principle and then divide by $\left(\frac{1}{K}\right)^n$.
Best Answer
Let your sample space $X$ be the set of strings of length $n$ with elements from $K$ categories, so that $|X|=K^n$. Suppose $1 \ldots k$ are the categories defining the random variable $W$. Let $X_i$ be the subset of $X$ whose elements do not contain an element of category $i$, as $i$ varies from $1$ to $k$. Then the event that $W > n$ is exactly the subset of the sample space $\displaystyle \bigcup_{i=1}^{k} X_i$.
Using inclusion-exclusion, the size of this union is $$\displaystyle \sum_{j=1}^k (-1)^{j+1}{k \choose j}(K-j)^n$$
In this summation, $j$ represents the number of categories of $1 \ldots k$ that are not seen in the string. The $(-1)^{j+1}$ comes from the inclusion-exclusion, ${k \choose j}$ counts the number of ways to pick which of the $k$ categories are skipped, and $(K-j)^n$ counts the number of ways to assign each element of the string excepting the $j$ categories to be avoided.