Does the sum of random variables sampled with/without substitution differ for large populations

central limit theoremlaw-of-large-numbersprobabilityprobability distributionsstatistics

We have a population of $N$ different balls. Half the balls are red, and half the balls are blue. We perform $N$ trials.

In trials $i = 1,\cdots,N$ we pick a ball $B_i$ randomly. First, we pick the balls with replacement. Then we pick the balls without replacement.

Define the sum $S = \displaystyle\sum_{i=2}^N \sigma_i$ where for each trial $i \in (2,3,…N)$

$$\sigma_i = \ \begin{cases}
1 & \text{if } B_i\text{ has same color as } B_{i-1} \\
0 & \text{otherwise}
\end{cases}
$$

If the balls are picked with replacement the probability mass function (pmf) for $\sigma_i$ is $g(\sigma_i) = \begin{cases}
0.5 & \text{for } \sigma_i=1 \\
0.5 & \text{for } \sigma_i=0
\end{cases} $
. Accordingly, the probability mass function for $S$ is described by the Binomial distribution

$$
f(s)= \binom{N-1}{s}\times0.5^{s}\times0.5^{N-1-s}
$$

In the limit $N\rightarrow\infty$, the binomial distribution becomes a Gaussian distribution.

I have performed a few computer simulations where the balls are picked without replacement, and the pmf for $S$ converges to the same binomial/Gaussian distribution in the limit $N\rightarrow\infty$. This convergence occurs even if $\sigma_i$ is no longer perfectly i.i.d (independent and identically distributed).

Why does the sum $S$ converge to the same distribution whether the balls are picked with or without replacement?

I suspect that I need a central limit theorem for weakly correlated variables.

Note: I want proof, but all explanations are welcome. I suspect there exists a fundamental explanation obvious for somebody with more statistics experience than myself.

Best Answer

Consider a random binary sequence $s_n$ of size $2n$ s.t. $\sum_{i=1}^{2n}s_{n,i}=n$. Then the number of runs of length $2$ ($11$ or $00$) is $$ S_{2n}=\sum_{i=1}^{2n-1}\xi_{n,i}+\zeta_{n,i}, $$ where $\xi_{n,i}=s_{n,i}s_{n,i+1}$ and $\zeta_{n,i}=(1-s_{n,i})(1-s_{n,i+1})$. Now $$ \mathsf{E}\xi_{n,i}=\mathsf{P}(\xi_{n,i}=1)=\frac{(2n-2)!}{(n-2)!n!}\left(\frac{(2n)!}{n!n!}\right)^{-1}=\frac{1}{2}\frac{n-1}{2n-1}, $$ $$ \mathsf{E}\xi_{n,i}^2=\mathsf{P}(\xi_{n,i}=1), $$ \begin{align} \mathsf{E}\xi_{n,i}\xi_{n,j}&=\mathsf{P}(\xi_{n,i}=1,\xi_{n,j}=1)=\frac{(2n-4)!}{(n-4)!n!}\left(\frac{(2n)!}{n!n!}\right)^{-1} \\ &=\frac{1}{4}\frac{(n-2)(n-3)}{(2n-1)(2n-3)}, \end{align} if $|i-j|>1$, and $$ \mathsf{E}\xi_{n,i}\xi_{n,j}=\frac{1}{4}\frac{n-2}{2n-1}, $$ if $|i-j|=1$. Finally, $\mathsf{E}\xi_{n,i}\zeta_{n,j}=0$, if $|i-j|\le 1$, and $$ \mathsf{E}\xi_{n,i}\zeta_{n,j}=\frac{1}{4}\frac{n(n-1)}{(2n-1)(2n-3)}, $$ if $|i-j|>1$.


We can compute the moments of $S_{2n}$: $$ \mathsf{E}S_{2n}=n-1. $$ \begin{align} \mathsf{E}S_n^2&=\sum_{i=1}^{2n-1}\mathsf{E}[\xi_{n,i}^2+\zeta_{n,i}^2]+\sum_{|i-j|=1}\mathsf{E}[\xi_{n,i}\xi_{n,j}+\zeta_{n,i}\zeta_{n,j}] \\ &\quad+\sum_{|i-j|>1}\mathsf{E}[(\xi_{n,i}+\zeta_{n,i})(\xi_{n,j}+\zeta_{n,j})] \\ &=(n-1)+\frac{2(n-1)(n-2)}{2n-1}+\frac{2(n^2-3n+3)(n-1)}{2n-1} \\ &=\frac{(n-1)(2n^2-2n+1)}{2n-1}, \end{align} and, thus, $$ \operatorname{Var}(S_{2n})=\frac{n(n-1)}{2n-1}. $$ As $n\to \infty$, $\mathsf{E}S_{2n}/(2n)\to 1/2$ and $\operatorname{Var}(S_{2n}/\sqrt{2n})\to 1/4$. Also, since $\operatorname{Cov}(\xi_{n,i},\xi_{n,i+1})\to 1/16$, and $\operatorname{Cov}(\xi_{n,i},\xi_{n,j})$ and $\operatorname{Cov}(\xi_{n,i},\zeta_{n,j})$ converge to $0$ when $|i-j|>1$, $(S_{2n}-\mathsf{E}S_{2n})/\sqrt{2n}$ behaves similarly to the same quantity when $s_{n,i}$'s are i.i.d. Bernoulli random variables.

Related Question