Probability – Proof of Expectation in Sampling Without Replacement

combinatoricscoupon-collectorexpected valueprobability

Here is my problem and subsequent attempted proof:

Assume we have $n$ different colored balls and $k$ of each color for a total of $nk$ balls. We want to draw from this set without replacement until we have seen at least one of each color.

Let $D_m$ define the number of distinct colored balls seen after $m$ draws from the defined set. Therefore, $D_m$ has the same probability distribution as the number of draws with replacement until each color is drawn. When sampling with replacement, we have that $m \approx n \log n$ (see coupon collector problem) so this will serve as an upperbound on $\mathbb{E}[D_m]$. Furthermore, if $k = 1$ then we have $n$ different balls and $n$ balls total, so we must draw each one to see each type.

The probability of drawing any ball on drawing $m$ is $$1 – \left(1 – \frac{1}{nk}\right)^m$$
Therefore, the expectation of $D_m$ is $$\mathbb{E}[D_m] = nk \left[ 1 – \left(1 – \frac{1}{nk}\right)^m\right] \rightarrow nk \left(1 – e^{-m/nk}\right) \text{ for $n \gg 1$}$$
Now let $m = c \cdot n \log n$ for some $c \in (0,1)$. This simplifies our expectation to $$\mathbb{E}[D_m] = nk \left(1-e^{-c \log n/k}\right) < nk\left(1-e^{-\log n/k}\right)$$
We now consider two cases, dependent upon the relative value of $k$ to $n$:

$\underline{\textit{Case I:}}$ $\lim_{n \rightarrow \infty}\frac{k}{\log n} = 0$
$$\lim_{n \rightarrow \infty}\mathbb{E}[D_m] = nk(1-0) = nk$$

$\underline{\textit{Case II:}}$ $\lim_{n \rightarrow \infty}\frac{k}{\log n} = \infty$

In this case, since $n \log n < nk$ we have by the sample-with-replacement bound that $$\mathbb{E}[D_m] = n \log n < nk$$
Therefore, if $k < \log n$ for all $n \in \mathbb{Z}$, in expectation we will draw every ball to see each type, whereas when $k > \log n$ we need not.

My question is the validity of the approach and namely if two of my "hand-waving" steps are valid: (1) imposing the $n \log n$ upperbound on expectation and (2) definition of $D_m$ as it relates to the sampling with replacement problem?

Best Answer

Mistakes in your analysis

To make things clear, let $N$ be the number of draws it takes to see all colors, so $E[N]$ is your quantity of interest.

Therefore, $D_m$ has the same probability distribution as the number of draws with replacement until each color is drawn.

There are two things wrong with this:

First, a unit mismatch. For $D_m$, the number of draws is fixed, and the number of distinct colors scene is random. This is not the same as having the number of colors be fixed, and the number of draws to reach that target being random. Therefore, there is no relation between $D_m$ and $N$.
- What you can probably say is that $E[N]\approx n\log n$.
Your definition of $D_m$ still fits in a "without-replacement" paradigm. You could say that it is approximately with-replacement.

The probability of drawing any ball on drawing $m$ is $1−(1−{1\over nk})^m$. Therefore, the expectation of $D_m$ is $\mathbb{E}[D_m] = nk \left[ 1 - \left(1 - \frac{1}{nk}\right)^m\right] \rightarrow nk \left(1 - e^{-m/nk}\right) \text{ for $n \gg 1$}$

This calculation assumes drawing with replacement, which is only approximately correct. Approximations should always be notated with $\approx$.
The calculation still computes the approximation incorrectly. The probability that a color is present after $m$ draws with replacement would be $$ 1-(1-\tfrac1{\color{red}{n}})^m \approx 1-e^{m/n} $$ Then, linearity of expectation would imply $$ E[D_m]\approx n(1-e^{m/n}) $$ The probability without replacement is $$ 1-\binom{(n-1)k}{m}\Big/\binom{nk}{m}, $$ so the exact expression is $$ E[D_m]=n\left(1-\binom{(n-1)k}{m}\Big/\binom{nk}{m}\right) $$

Therefore, if $k < \log n$ for all $n \in \mathbb{Z}$, in expectation we will draw every ball to see each type, whereas when $k > \log n$ we need not.

The problem here is your language is imprecise. What does "in expectation we will draw every ball to see each type" mean? In order to find $E[N]$, you will need a different method.

For each $i\in \{1,\dots,n\}$, $A_{i,m}$ be the event that you do not see color number $i$ in the first $m$ draws. Note $$ P(A_{1,m}\cap A_{2,m}\cdots \cap A_{j,m})=\binom{(n-j)k}{m}\Big/\binom {nk}m $$

Now, letting $B_m$ be the event that some color is missing after $m$ draws, we can use the above result, along with the principle of inclusion exclusion, to conclude $$ P(B_m)=\sum_{j=1}^n(-1)^{j-1}\binom{n}{j}\binom{(n-j)k}{m}\Big/\binom {nk}m $$ Warning: For this formula to work, we need to use the following convention: $\binom{a}{b}$ is defined to be zero when $b>a$ is negative. This means we do not have to worry about summation bounds.

Finally, $$ E[N]=\sum_{m=0}^{\infty}P(N>m)=\sum_{m=0}^{nk} P(B_m)= \sum_{m=0}^{nk}\sum_{j=1}^n(-1)^{j-1}\binom{n}{j}\binom{(n-j)k}{m}\Big/\binom {nk}m $$

Best Answer

Mistakes in your analysis

Related Solutions

[Math] Variance for the number of trials before success in an urn problem without replacement

[Math] Coupon collector without replacement

Related Question