[Math] Unbiased estimator of the variance with known population size

parameter estimationprobabilityprobability theorystatisticsvariance

The variance is defined as

$$\sigma^2 = \frac{\sum_{i=1}^n (x_i – \bar x)^2}{n}$$

where, $\bar x = \frac{\sum_{i=1}^n x_i}{n}$

If someone wants to estimate this parameter from a sample (s)he must do

$$s^2 = \frac{\sum_{i=1}^n (x_i – \bar x)^2}{n-1}$$

as the variance (as would be calculated by $\sigma^2$) of a sample decreases with the size of the sample.

$s^2$ is an unbiased estimator of $\sigma^2$ only if sampling is with replacement (which is not the case in the model of interest here) or if the population is infinite. Let's call $N$ the size of the population ($n$ being the size of the sample). To the extreme, if $n=N$ (so that every individual is sampled), then $s^2$ will definitely be a biased estimator of the variance in the population.

What is an unbiased estimator of the variance of the population from a sample knowing the population size $N$?

Best Answer

Let's assume you have a population of size $N$ with values $x_1,\ldots,x_N$, mean $\bar x=\frac{1}{N}\sum_{i=1}^N x_i$ and variance $\sigma^2=\frac{1}{N}\sum_{i=1}^N(x_i-\bar x)^2$. (Note that I use lower case $x_i$ to indicate these are not random, but fixed values.)

Now, let's take a random sample $Y_1,\ldots,Y_n$ of $n$ elements (without replacement), with all such subsets equally likely. (Now I use capital $Y$ to indicate these are random.)

Now, $\bar Y=\frac{1}{n}\sum_{i=1}^n Y_i$ and let $V=\sum_{i=1}^n (Y_i-\bar Y)^2$ so that the sample variance would be $V/n$ (like the expression for $\sigma^2$). If we write $V$ out in terms of $(Y_i-\bar x)^2$ and $(Y_i-\bar x)(Y_j-\bar x)$, we get $$ \begin{split} V =& \sum_{i=1}^n (Y_i-\bar Y)^2 = \sum_{i=1}^n \left[(Y_i-\bar x)-(\bar Y-\bar x)\right]^2 \\ =& \sum_{i=1}^n \left[(Y_i-\bar x)^2-2(Y_i-\bar x)(\bar Y-\bar x)+(\bar Y-\bar x)^2 \right] \\ =& \sum_{i=1}^n (Y_i-\bar x)^2 - n(\bar Y-\bar x)^2 \\ =& \left(1-\frac{1}{n} \right) \sum_{i=1}^n (Y_i-\bar x)^2 -\frac{2}{n}\sum_{1\le i<j\le n} (Y_i-\bar x)(Y_j-\bar x) \end{split} $$ where in the last step we use that $$ \left(\sum_{i=1}^n (Y_i-\bar x)\right)^2 = \sum_{i=1}^n (Y_i-\bar x)^2 + 2\sum_{1\le i<j\le n} (Y_i-\bar x)(Y_j-\bar x). $$

We know that $\text{E}[(Y_i-\bar x)^2]=\sigma^2$: this is just taking the average of $(Y_i-\bar x)^2$ for $Y_i$ sampled from $x_1,\ldots,N$.

For $i<j$, we can compute $\text{E}[(Y_i-\bar x)(Y_j-\bar x)]$ by using that this is the same as the average of $(x_i-\bar x)(x_j-\bar x)$ for all $1\le i<j\le N$. Since $\sum_{i=1}^N (x_i-\bar x)=0$, we get $$ 0 = \sum_{1\le i,j\le N} (x_i-\bar x)(x_j-\bar x) = \sum_{i=1}^N (x_i-\bar x)^2 + 2\sum_{1\le i<j\le N} (x_i-\bar x)(x_j-\bar x) $$ which for $i<j$ makes $$ \text{E}\left[(Y_i-\bar x)(Y_j-\bar x)\right] = -\frac{\sigma^2}{N-1}. $$ Combining these results, we get $$ \text{E}[V] = (n-1)\sigma^2 + \frac{n-1}{N-1}\sigma^2 = \frac{(n-1)N}{N-1}\sigma^2 $$ giving an unbiased estimator $$ \hat\sigma^2 = \frac{N-1}{N(n-1)}V = \frac{N-1}{N(n-1)} \sum_{i=1}^n (Y_i-\bar Y)^2. $$

As $N\rightarrow\infty$, you get the familiar $s^2$ estimator which corresponds to independent sampling from a distribution, while $n=N$ gives just $\sigma^2$ as it should when the $x_i$ are known for the whole population.

Related Question