[Math] Unbiased estimator of the variance with known population size

parameter estimationprobabilityprobability theorystatisticsvariance

The variance is defined as

$$\sigma^2 = \frac{\sum_{i=1}^n (x_i – \bar x)^2}{n}$$

where, $\bar x = \frac{\sum_{i=1}^n x_i}{n}$

If someone wants to estimate this parameter from a sample (s)he must do

$$s^2 = \frac{\sum_{i=1}^n (x_i – \bar x)^2}{n-1}$$

as the variance (as would be calculated by $\sigma^2$) of a sample decreases with the size of the sample.

$s^2$ is an unbiased estimator of $\sigma^2$ only if sampling is with replacement (which is not the case in the model of interest here) or if the population is infinite. Let's call $N$ the size of the population ($n$ being the size of the sample). To the extreme, if $n=N$ (so that every individual is sampled), then $s^2$ will definitely be a biased estimator of the variance in the population.

What is an unbiased estimator of the variance of the population from a sample knowing the population size $N$?

Best Answer

Let's assume you have a population of size $N$ with values $x_1,\ldots,x_N$, mean $\bar x=\frac{1}{N}\sum_{i=1}^N x_i$ and variance $\sigma^2=\frac{1}{N}\sum_{i=1}^N(x_i-\bar x)^2$. (Note that I use lower case $x_i$ to indicate these are not random, but fixed values.)

Now, let's take a random sample $Y_1,\ldots,Y_n$ of $n$ elements (without replacement), with all such subsets equally likely. (Now I use capital $Y$ to indicate these are random.)

Now, $\bar Y=\frac{1}{n}\sum_{i=1}^n Y_i$ and let $V=\sum_{i=1}^n (Y_i-\bar Y)^2$ so that the sample variance would be $V/n$ (like the expression for $\sigma^2$). If we write $V$ out in terms of $(Y_i-\bar x)^2$ and $(Y_i-\bar x)(Y_j-\bar x)$, we get $$ \begin{split} V =& \sum_{i=1}^n (Y_i-\bar Y)^2 = \sum_{i=1}^n \left[(Y_i-\bar x)-(\bar Y-\bar x)\right]^2 \\ =& \sum_{i=1}^n \left[(Y_i-\bar x)^2-2(Y_i-\bar x)(\bar Y-\bar x)+(\bar Y-\bar x)^2 \right] \\ =& \sum_{i=1}^n (Y_i-\bar x)^2 - n(\bar Y-\bar x)^2 \\ =& \left(1-\frac{1}{n} \right) \sum_{i=1}^n (Y_i-\bar x)^2 -\frac{2}{n}\sum_{1\le i<j\le n} (Y_i-\bar x)(Y_j-\bar x) \end{split} $$ where in the last step we use that $$ \left(\sum_{i=1}^n (Y_i-\bar x)\right)^2 = \sum_{i=1}^n (Y_i-\bar x)^2 + 2\sum_{1\le i<j\le n} (Y_i-\bar x)(Y_j-\bar x). $$

We know that $\text{E}[(Y_i-\bar x)^2]=\sigma^2$: this is just taking the average of $(Y_i-\bar x)^2$ for $Y_i$ sampled from $x_1,\ldots,N$.

For $i<j$, we can compute $\text{E}[(Y_i-\bar x)(Y_j-\bar x)]$ by using that this is the same as the average of $(x_i-\bar x)(x_j-\bar x)$ for all $1\le i<j\le N$. Since $\sum_{i=1}^N (x_i-\bar x)=0$, we get $$ 0 = \sum_{1\le i,j\le N} (x_i-\bar x)(x_j-\bar x) = \sum_{i=1}^N (x_i-\bar x)^2 + 2\sum_{1\le i<j\le N} (x_i-\bar x)(x_j-\bar x) $$ which for $i<j$ makes $$ \text{E}\left[(Y_i-\bar x)(Y_j-\bar x)\right] = -\frac{\sigma^2}{N-1}. $$ Combining these results, we get $$ \text{E}[V] = (n-1)\sigma^2 + \frac{n-1}{N-1}\sigma^2 = \frac{(n-1)N}{N-1}\sigma^2 $$ giving an unbiased estimator $$ \hat\sigma^2 = \frac{N-1}{N(n-1)}V = \frac{N-1}{N(n-1)} \sum_{i=1}^n (Y_i-\bar Y)^2. $$

As $N\rightarrow\infty$, you get the familiar $s^2$ estimator which corresponds to independent sampling from a distribution, while $n=N$ gives just $\sigma^2$ as it should when the $x_i$ are known for the whole population.

Related Solutions

[Math] How is the sample variance an unbiased estimator for population variance

First ask yourself, what does it mean for a statistic to be an estimator? Do all estimators have to be "good" ones?

Next, the MLE is "best" in the sense that such a choice maximizes the likelihood function for the observed sample, but that doesn't necessarily mean it is the only suitable choice for an estimator. It is in some sense the most likely choice for the parameter given the data we observed, but from the point of view of biasedness, it tends to underestimate the true variance. That is to say, the MLE for $\sigma^2$ will, on average, give an estimate that is too small for a fixed sample size, whereas $s^2$ does not have this problem, especially when the sample size is small.

We can also think of the quality of an estimator as being judged by other desirable properties; e.g., consistency, asymptotic unbiasedness, minimum mean squared error, or UMVUE. Maximum likelihood is just one possible criterion.

[Math] unbiased pool estimator of variance

First, your notation for the sample variance seems to be muddled. The sample variance is ordinarily defined as $S^2 = \frac{1}{n-1}\sum_{i=1}^n (X_i - \bar X)^2,$ which makes it an unbiased estimator of the population variance $\sigma^2.$

Perhaps the most common context for 'unbiased pooled estimator' of variance is for the 'pooled t test': Suppose you have two random samples $X_i$ of size $n$ and $Y_i$ of size $m$ from populations with the same variance $\sigma^2.$ Then the pooled estimator of $\sigma^2$ is

$$S_p^2 = \frac{(n-1)S_X^2 + (m-1)S_Y^2}{m+n-2}.$$

This estimator is unbiased.

Because one says the samples have respective 'degrees of freedom' $n-1$ and $m-1,$ one sometimes says the $S_p^2$ is a 'degrees-of-freedom' weighted average of the two sample variances. If $n = m,$ then $S_p^2 = 0.5S_x^2 + 0.5S_Y^2.$

Note: Some authors do define the sample variance as $\frac{1}{n}\sum_{i=1}^n (X_i - \bar X)^2,$ but then the sample variance is not an unbiased estimator of $\sigma^2,$ even though it might have other properties desirable for the author's task at hand. However, most agree that the notation $S^2$ is reserved for the version with $n-1$ in the denominator, unless a specific warning is given otherwise.

Example: One common measure of the 'goodness' of an estimator is that it have a small 'root mean squared error'. If $T$ is an estimate of $\tau$ then $\text{MSE}_T(\tau) = E[(T-\tau)^2]$ and RMSE is its square root.

The simulation below illustrates for normal data with $n = 5$ and $\sigma^2 = 10^2 = 100,$ that the version of the sample variance with $n$ in the denominator has smaller RMSE than the version with $n-1$ in the denominator. (A formal proof for $n > 1$ is not difficult.)

set.seed(1888);  m = 10^6;  n = 5;  sigma = 10;  sg.sq = 100
v.a = replicate(m, var(rnorm(n, 100, sigma)))  # denom n-1
v.b = (n-1)*v.a/n                              # denom n
mean(v.a); RMS.a = sqrt(mean((v.a-sg.sq)^2)); RMS.a
[1] 100.0564  # 70.81563
[1] 70.81563  # larger RMSE
mean(v.b); RMS.b = sqrt(mean((v.b-sg.sq)^2)); RMS.b
[1] 80.0451   # biased   
[1] 60.06415  # smaller RMSE

Best Answer

Related Solutions

[Math] How is the sample variance an unbiased estimator for population variance

[Math] unbiased pool estimator of variance

Related Question