How can one calculate the distribution of this “multinomial” analog of the geometric distribution

combinatoricsmultinomial-coefficientsprobabilityproof-verification

The specific word problem that motivated this question was:

Generate random numbers 0-9 uniformly. Define $W$ to be the number of trials required for at least one 4, at least one 5, and at least one 6 to show up.

For example, if the generated sequence is $444444440980981245550986$, then $W=24$.

More generally:

Question 1: Given a Multinoulli random variable (e.g. a fair die) with probability distributed uniformly between $K$ categories. Define $W$ to be the number of trials required for some subset $1 \le k \le K$ of the categories to all appear at least once. What is the distribution of $W$?

Possible solution: Calculating $\mathbb{P}(W \le n)$ corresponds (I think) to calculating

  • all length $n$ sequences in which at least one of each category shows up (i.e. in $k$ of the trials), and then the sequences can do whatever they want in the remaining trials

and then dividing this number by

  • the number of all possible length $n$ sequences.

So we can break this up into sub-problems as follows:

I think the first number is (according to the fundamental counting principle):

$$ \sum_{m=k}^n \sum_{x_1 + \dots + x_k = m \,, \\ \neg(x_1 = \dots = x_k = 0)} \binom{n}{x_1, x_2, \dots, x_k} (K – k)^{n-m} $$

and the second number is $K^n$, so is the probability just:

$$ \mathbb{P}(W \le n) \overset{?}{=} \sum_{m=k}^n \sum_{x_1 + \dots + x_k = m \,, \\ \neg(x_1 = \dots = x_k = 0)} \binom{n}{x_1, x_2, \dots, x_k} \left(\frac{K – k}{K} \right)^{n-m} \left(\frac{1}{K}\right)^m \,? $$

Question 2: If the above formula is correct, is there a way to simplify it?

It does appear to simplify to the geometric CDF for $p= \frac{1}{2}$ (take $K=2$ and $k=1$). A simplification which held more generally would be nice.

EDIT: It occurred to me that always $m >0$, since $m \ge k$, but $x_1 = \dots = x_k = 0$ would imply $x_1 + \dots + x_k = m = 0$, a contradiction. Therefore the second condition in the sum is redundant, so the above expression simplifies to:

$$\sum_{m=k}^n \sum_{x_1 + \dots + x_k = m } \binom{n}{x_1, x_2, \dots, x_k} \left(\frac{K – k}{K} \right)^{n-m} \left(\frac{1}{K}\right)^m \,, $$

which we can get a simplified closed form for if we can simplify

$$\sum_{x_1 + \dots + x_k = m } \binom{n}{x_1, x_2, \dots, x_k} \quad \text{for }k \le m \le n \,. $$

When $m=n$ the multinomial theorem gives us that this sum is equal to $k^n$. But I don’t know about more generally.

Progress so far: In the case that $k=1$, this reduces to a regular geometric distribution, since we can "clump" all of the remaining $K-1$ categories into one super-category.

However, for $k>1$, this trick no longer seems to work. This is because the events of any of the categories in the $k$ categories showing up do not seem to be independent. E.g. in the example above, if one generates a $4$, one also knows that one did not generate a $6$.

And it isn't sufficient to just calculate the distribution that any one of categories appears $k$ times, since one could have the same category appearing $k$ times without any of the categories of interest appearing at all, e.g. $444$ in the above example.

It seems like it might be easier to calculate the probabilities:

$$\mathbb{P}(W > n)\,, \quad n \ge k \,. $$

This would allow us to get the distribution, since:

$$F_W = \begin{cases} 0 & 0 \le n < k \\ 1 – \mathbb{P}(W > n) \end{cases} $$

and because $W$ is non-negative we even get the expectation for free:

$$\mathbb{E}[W] = \sum_{n=0}^{\infty} \mathbb{P}(W > n) \,. $$

The only thoughts I've had for calculating $\mathbb{P}(W =k )$ so far are:

  • find a recurrence relation for $\mathbb{P}(W >n)$ in terms of $\mathbb{P}(W > n-1)$ (and possibly also $\mathbb{P}(W > n-2)$), and then try to derive a probability generating function from this somehow.
  • Brute force calculate the number of sequences of interest using some combinatorial principle and then divide by $\left(\frac{1}{K}\right)^n$.

Best Answer

Let your sample space $X$ be the set of strings of length $n$ with elements from $K$ categories, so that $|X|=K^n$. Suppose $1 \ldots k$ are the categories defining the random variable $W$. Let $X_i$ be the subset of $X$ whose elements do not contain an element of category $i$, as $i$ varies from $1$ to $k$. Then the event that $W > n$ is exactly the subset of the sample space $\displaystyle \bigcup_{i=1}^{k} X_i$.

Using inclusion-exclusion, the size of this union is $$\displaystyle \sum_{j=1}^k (-1)^{j+1}{k \choose j}(K-j)^n$$

In this summation, $j$ represents the number of categories of $1 \ldots k$ that are not seen in the string. The $(-1)^{j+1}$ comes from the inclusion-exclusion, ${k \choose j}$ counts the number of ways to pick which of the $k$ categories are skipped, and $(K-j)^n$ counts the number of ways to assign each element of the string excepting the $j$ categories to be avoided.

Related Question