Variance of number of distinct values in a collection of iid discrete random variables

concentration-of-measureprobabilitystochastic-processesvariance

I'm struggling with the following problem. Let $X_i$ be iid discrete rvs, taking values in $\mathbb{N}$ with arbitrary distribution $\mathrm{P} (X_i = k) = p_k$. Let $Z_n$ be a number of distinct values in $X_1, \ldots, X_n$. I need to find its expectation and variance and compare it to Efron-Stein inequality estimate on variance.

My reasoning is as follows: let $Y_i$ be indicator rv, assuming $1$ if $X_i$ has value different from $X_1, \ldots, X_{i – 1}$. Then

$$
\mathrm{P} (Y_i = 1) = \sum_{i = k}^{\infty} \mathrm{P} (Y_i = 1 | X_i = k) \mathrm{P} (X_i = k) = \\ = \sum_{k = 1}^{\infty} \mathrm{P} (X_1 \neq k, \ldots X_{i – 1} \neq k) \mathrm{P} (X_i = k) = \sum_{k = 1}^{\infty} (1 – p_k)^{i – 1} p_k.
$$

Then, since $Z_n = \sum_{i = 1}^n Y_i$:

$$
\mathrm{E} Z_n = \sum_{i = 1}^n \mathrm{E} Y_i = \sum_{i = 1}^n \sum_{k = 1}^{\infty} (1 – p_k)^{i – 1} p_k.
$$

So far so good. I checked my calculations with simulations from Poisson distribution – they agree very closely.

Now to variance. Since $Z_n – Z_{n – 1}$ differs at most by one, for function with bounded differences by Efron-Stein inequality we have an estimation on variance:

$$
\mathrm{Var} (Z_n) \leq \frac{n}{4}.
$$

But I can't calculate variance analytically to compare it to estimation. Unfortunately, $Y_n$ and $Y_m$ are anti-correlated, so simple sum of separate variances won't do. I tried to calculate covariance, but ended up with rather messy formula. Any ideas how I can get a nice expression for $\mathrm{Var} (Z_n)$?

Best Answer

You can rewrite $Z_n$ as $$ Z_n = \sum_{i=1}^\infty I(\{X_1=i\}\cup\ldots\cup\{X_n=i\}) = \sum_{i=1}^\infty\bigl(1-I(X_1\neq i,\ldots,X_n\neq i)\bigr) $$ So $$ \mathbb EZ_n = \sum_{i=1}^\infty\bigl(1-\mathbb P(X_1\neq i,\ldots,X_n\neq i)\bigr) = \sum_{i=1}^\infty\bigl(1-(1-p_i)^n\bigr). $$ In the paper N. S. Zakrevskaya, A. P. Kovalevskii, “One-parameter probabilistic models of text statistics”, Sib. Zh. Ind. Mat., 4:2 (2001), 142–153 (only in Russian) the following inequality is proved, see p.146, Lemma 3: $$ \mathrm{Var}(Z_n)\leq \mathbb EZ_n. $$ The proof is very simple: $$ \mathbb E(Z_n^2) = \sum_{i=1}^\infty\sum_{j=1}^\infty\bigl(1-\mathbb P(X_1\neq i,\ldots, X_n\neq i)-\mathbb P(X_1\neq j,\ldots, X_n\neq j)+\mathbb P(X_1\neq i,\ldots, X_n\neq i,X_1\neq j,\ldots, X_n\neq j)\bigr) $$ $$\tag{1}\label{1} =\sum_{i=1}^\infty\sum_{j=1\\j\neq i}^\infty\bigl(1-(1-p_i)^n-(1-p_j)^n+(1-p_i-p_j)^n\bigr)+\mathbb EZ_n $$ $$ \leq \sum_{i=1}^\infty\sum_{\color{red}{j=1}}^{\color{red}\infty}\bigl(1-(1-p_i)^n-(1-p_j)^n+(1-p_i-p_j\color{red}{+p_ip_j})^n\bigr)+\mathbb EZ_n $$ $$ =\bigl(\mathbb EZ_n\bigr)^2+\mathbb EZ_n. $$ So, $\mathrm{Var}(Z_n)\leq \mathbb EZ_n$. Two pages after in Lemma 6 it is proved that $\mathbb EZ_n =o(n)$ as $n\to\infty$.

You can use the expression for second moment to try to find variance, but I doubt it is possible. Subtracting squared expectation from \eqref{1} we get $$ \mathrm{Var}(Z_n)=\sum_{i=1}^\infty\sum_{j=1\\j\neq i}^\infty\bigl((1-p_i-p_j)^n-(1-p_i)^n(1-p_j)^n\bigr)+\sum_{i=1}^\infty (1-p_i)^n(1-(1-p_i)^n) $$ The first summand is negative, and the second one gives very small results in examples. Say, under geometric distribution $p_i=2^{-i}$, for $n=20$ it equals to $0.98495$. For $p_i=\frac12(\frac23)^i$ and $n=10$ second summand equals to $1.64842$.

Related Question