1956 Miklos Schweitzer Counting – Problem 10

combinatoricscontest-mathprobabilityprobability distributionsprobability theory

Source: 1956 Miklos Schweitzer Contest (a Hungarian undergraduate open-note math contest) Problem 10

In an urn there are balls of $N$ different colours, $n$ balls of each colour. Balls are drawn and not replaced until one of the colours turns up twice; denote by $V_{N,n} $ the number of the balls drawn and by $M_{N,n}$ the expectation of the random variable $v_{N,n}$. Find the limit distribution of the random variable $\frac{V_{N,n}}{M_{N,n}}$ if $N \to \infty$ and $n$ is a fixed number.

Attempt (wrong): My thought process is that I'd 1. find the distribution of $V_{N,n}$, 2. compute $\mathbb{E}[V_{N,n}]$, and 3. apply a change of variables to get the distribution in question $Y=\frac{V_{N,n}}{\mathbb{E}[V_{N,n}]}$, and 4. justify the limiting distribution (I assume not too bad if we fix $n$).

  1. We can justify the density by counting. If we desire $P(V_{N,n}=k)$, then this event occurs if we select $k-1$ colors from $N$ possibilities, $\binom{N}{k-1}$, and we select one ball from each of the $k-1$ colors for the first $k-1$ draws, $\binom{n}{1}^{k-1}$, and finally we select the last ball to be one of the $k-1$ colors already chosen, $\binom{k-1}{1}$. In summary, we have for $k\in\{2,\dots,N+1\}$

$$\mathbb{P}\left(V_{N,n}=k\right) = \frac{\binom{N}{k-1}n^{k-1}(k-1)}{\binom{Nn}{k}}.$$

  1. From here, I thought oh maybe I'd apply some combinatorial identity like hockey-stick, Pascal's, the binomial theorem, look at generating functions or something similar to compute the expectation:
    $$\mathbb{E}\left[V_{N,n}\right]=\sum\limits_{k=2}^{N+1} \frac{\binom{N}{k-1}n^{k-1}(k-1)}{\binom{Nn}{k}}\cdot k.$$
    However, I have no idea how to deal with the bottom – I assume I could apply Stirling's to get an approximation, but is there something exact here?

  2. Even if step 2 failed, I try to continue – by change of variables, we have
    $$\mathbb{P}\left(Y=y\right)=\mathbb{P}\left(V_{N,n}=\mathbb{E}[V_{N,n}]y\right)\cdot\mathbb{E}[V_{N,n}]=\frac{\binom{N}{\mathbb{E}[V_{N,n}]y-1}n^{\mathbb{E}[V_{N,n}]y-1}(\mathbb{E}[V_{N,n}]y-1)}{\binom{Nn}{\mathbb{E}[V_{N,n}]y}}\cdot \mathbb{E}[V_{N,n}].$$
    I can obviously not obtain a closed form without finding the expectation; however, it is evident that $\mathbb{E}[V_{N,n}]\propto N$ and $\mathbb{E}[V_{N,n}]\propto n^{-1}$. Thus as $N\rightarrow\infty$, … uh never mind, there's nothing I think we could deduce.

Edit 1 (old method): With Raskolnikov's help, I now have, after some manipulation,

$$\mathbb{E}[V_{N,n}]=n(n-1)\sum\limits_{k=2}^N \frac{\binom{N}{k-1}n^{k-2}(k-1)}{\binom{Nn}{k}}+\frac{n^{N-1}(N+1)}{\binom{Nn-1}{N-1}}.$$

This looks a little more interesting – I really wanted to get the sum on the left to look like the derivative of the sum if I differentiate with respect to $n$, but the denominator is still troubling me. I thought maybe Vandermonde's would be useful

$$\binom{Nn}{k}=\sum\limits_{k_1+\cdots+k_n=k} \binom{N}{k_1}\cdots\binom{N}{k_n},$$

but that also seems to lead to a dead end I think.

Edit 2 (new method): Okay, with Raskolnikov's new method, I think it suffices to show the distribution of $V_{N,n}$ is exponential for some rate $r\in(0,\infty)$. With this approach, I tried the following (setting $n=2$ WLOG)
$$\mathbb{P}(V_{N,n}>k)=\frac{2^{k-1}\binom{N-1}{k-1}}{\binom{2N-1}{k-1}}$$
$$= 2^k\frac{(2N-k)\cdots(N-k+1)}{2N\cdots(N+1)}$$
$$= 2^k\left(1-\frac{k}{2N}\right)\cdots\left(1-\frac{k}{N+1}\right)$$
$$=^*\left(1-\frac{rk}{N}\right)^N\rightarrow e^{-rk}\;\;\text{as}\;\; N\rightarrow\infty$$

but I'm blanking on the choice of $r$ to get $=^*$ to hold (or if it's even possible).

Question: I would appreciate some help for evaluating the limit in the new edited section. Thanks!

Best Answer

This is almost like the birthday paradox problem, but without replacement. So it is natural to expect that $M_{N,n}\propto \sqrt{N}$. For any $x>0$, write $$ \mathbb{P}(V_{N,n}>x\sqrt{N}) = \frac{nN\cdot n(N-1) \cdots n(N-x\sqrt{N}+1)}{nN(nN-1)\cdots (nN-x\sqrt{N}+1)} \\ = n^{x\sqrt{N}}\frac{N!(nN - x\sqrt{N})!}{(N-x\sqrt{N})!(nN)!}\\ \sim n^{x\sqrt{N}} \frac{\sqrt{2\pi N} \cdot \left(\frac{N}{e}\right)^{N}\cdot \sqrt{2\pi (nN-x\sqrt{N})} \cdot \left(\frac{nN-x\sqrt{N}}{e}\right)^{nN-x\sqrt{N}}}{\sqrt{2\pi (N-x\sqrt{N})} \cdot \left(\frac{N-x\sqrt{N}}{e}\right)^{N-x\sqrt{N}}\sqrt{2\pi n N} \cdot \left(\frac{nN}{e}\right)^{nN}}\\ \sim \frac{\left(1-\frac{x}{n\sqrt{N}}\right)^{nN-x\sqrt{N}}}{\left(1-\frac{x}{\sqrt{N}}\right)^{N-x\sqrt{N}}}, \quad N\to\infty. $$ Now $$ \log \frac{\left(1-\frac{x}{n\sqrt{N}}\right)^{nN-x\sqrt{N}}}{\left(1-\frac{x}{\sqrt{N}}\right)^{N-x\sqrt{N}}}\\ = (nN-x\sqrt{N})\cdot \log \left(1-\frac{x}{n\sqrt{N}}\right) - (N-x\sqrt{N})\cdot \log \left(1-\frac{x}{\sqrt{N}}\right)\\ = -(nN-x\sqrt{N})\cdot \left(\frac{x}{n\sqrt{N}} + \frac{x^2}{2nN} + o(N^{-1})\right)\\ + (N-x\sqrt{N})\cdot \left(\frac{x}{\sqrt{N}}-\frac{x^2}{2N} + o(N^{-1})\right)\\ = -\frac{x^2}{2}\cdot \left(1-\frac1n\right) + o(1),\quad N\to\infty. $$ As a result, $$ \mathbb{P}(V_{N,n}>x\sqrt{N}) \to \exp\left\{-\frac{x^2}{2}\cdot \left(1-\frac1n\right) \right\},\quad N\to\infty. $$ Consequently, $V_{N,n}/\sqrt{N}$ converges weakly to the Rayleigh distribution with scale parameter $\sigma_n=(1-n^{-1})^{-1/2}$.

I believe that with a little bit of extra work (basically, through $M_{N,n} = \int_{0}^\infty \mathbb{P}(V_{N,n}>t)dt$; I'm too lazy to do it) one can show that $M_{N,n}\sim \sigma_n\sqrt{\pi N/2}, N\to\infty$. As a result, $V_{N,n}/M_{N,n}$ converges in distribution to the Rayleigh distribution with parameter $\sqrt{2/\pi}$ (and mean $1$).


Here is another, less formal and perhaps more intuitive, derivation (we look at $V_{N,n}/\sqrt{N}$, so work in this scale): given that $V_{N,n}>x\sqrt{N}$, the conditional probability that a repetition will happen in the next $\Delta x \sqrt{N}$ draws, is approximately $$ \mathbb P(V_{N,n}< (x+\Delta x) \sqrt{N}\mid V_{N,n}>x\sqrt{N})\approx 1 - \Big(\frac{nN-nx\sqrt{N}}{nN - x\sqrt{N}}\Big)^{\Delta x\sqrt{N}} \\ = 1 - \Big(1 - \frac{(n-1) x}{n\sqrt{N} - x}\Big)^{\Delta x\sqrt{N}} \approx \Big(1 - \frac{1}{n}\Big)x \cdot \Delta x. $$ This equation somehow confirms the intuition that the hazard rate is proportional to the number of balls already drawn. Hence, denoting by $\overline F_n$ the limiting complementary cdf of $V_{N,n}/\sqrt{N}$, we get $$ \frac{\overline F_n(x) -\overline F_n(x+\Delta x)}{\overline F_n(x)}\approx \Big(1 - \frac{1}{n}\Big)x \cdot \Delta x, $$ so $$ \overline F_n'(x) = -\Big(1 - \frac{1}{n}\Big)x\cdot \overline F_n(x), $$ whence, taking into account that $\overline F_n(0) = 1$, we get $$ \overline F_n(x)= \exp\left\{-\frac{x^2}{2}\cdot \left(1-\frac1n\right) \right\}. $$