Formula for conditional expectation in a binomial sample

binomial distributionparameter estimationprobability distributionsstatistical-inferencestatistics

Let's take $N$ i.i.d. stochastic variables $X_i$, where $X_i \sim Bin(n,p)$.

Taking inspiration from here, we should have the following facts:

  1. $Var(X_i)=np(1-p)$.
  2. The sample statistics $M(X_1,…,X_n)=\frac{\sum_i X_i}{N}$ is a sufficient and complete statistics.
  3. The MLE estimator for $np(1-p)$ is $T_{MLE}=nM(1-M)$.

Combining these points we have that the $T_{MLE }$ is also UMVUE by the Lehmann-Scheffe' lemma.

Now we have also the following fact:

  1. The (corrected) sample variance $S^2=\frac{1}{N-1}\sum_i{(X_i-M)^2}$ is an unbiased estimate of $Var(X_i)$.

From Lehmann-Scheffe' we should have then by consistency:

$$E[S^2\mid M]=nM(1-M)$$

My questions:

  • Is my reasoning correct or am I applying some theorem in a wrong way ?

  • If the reasoning is correct, what would be a direct derivation of the final result ? Is the formula trivial for some reason I do not see now ?

Best Answer

Your reasoning is correct except MLE is not the UMVUE of the population variance.

A complete sufficient statistic for $p$ is $T=\sum\limits_{i=1}^N X_i$, which has a $\mathsf{Bin}(nN,p)$ distribution.

Now $E_p[T]=nNp$ and $\operatorname{Var}_p[T]=nNp(1-p)$ for all $p\in(0,1)$.

Again, $$E_p[T^2]=\operatorname{Var}_p[T]+(E_p[T])^2=nNp(1-p)+n^2N^2p^2$$

Or, $$E_p[T^2-T]=nNp^2(nN-1)$$

That is, $$E_p\left[\frac{T(T-1)}{N(nN-1)}\right]=np^2$$

So you have an unbiased estimator of population variance based on $T$ (and hence UMVUE):

$$E_p\left[\frac TN-\frac{T(T-1)}{N(nN-1)}\right]=np-np^2=np(1-p)\quad,\forall\,p\in(0,1)$$

With $\overline X=\frac TN$, the sample variance $S^2=\frac1{N-1}\sum\limits_{i=1}^N (X_i-\overline X)^2$ is unbiased for population variance. So by Lehmann-Scheffe, $E\left[S^2\mid T\right]$ is also UMVUE of $np(1-p)$.

As UMVUE is unique whenever it exists, you can say

$$E\left[S^2\mid T\right]=\frac TN-\frac{T(T-1)}{N(nN-1)}\tag{*}$$

This can be rewritten in terms of $\overline X$ of course.


A direct way to obtain $(*)$ would be to proceed using linearity of expectation.

I think it should be something like

\begin{align} E\left[S^2\mid T=t\right]&=E\left[\frac{1}{N-1}\sum_{i=1}^N\left(X_i-\frac tN\right)^2\mid T=t\right] \\&=E\left[\frac{1}{N-1}\left(\sum_{i=1}^N X_i^2-\frac{t^2}{N}\right)\mid T=t\right] \\&=\frac{1}{N-1}\sum_{i=1}^N E\left[X_1^2\mid T=t\right]-\frac{t^2}{N(N-1)} \end{align}

Now we only have to recall that $X_1$ conditioned on $T$ has a hypergeometric distribution.