Unbiased estimator of population variance for sampling without replacement

inferencesurveysurvey-samplingunbiased-estimator

What I wrote below only apply to the situation where we have finite population.

I saw many of my friends used sample variance with Bessel's correction $\frac{\sum_i^n (X_i – \bar{X})^2}{n-1}$ to estimate the population variance $\frac{\sum_i^N (X_i – \mu)^2}{N}$ after collecting the data because they thought that it was an unbiased estimator, but I think it is not correct as unbiasedness is also dependent on sampling scheme. It is very misleading to say that we should use sample variance with Bessel's correction to estimate population variance while implying that the scheme is Simple Random Sampling With Replacement SRSWR.

The formulas for SRSWR are simple, so they are often used and introduced in introductory statistics course, but the problem is that when people survey using probability sampling method, in many times, they use Simple Random Sampling Without Replacement SRSWOR.

If the sampling scheme is SRSWOR, $\frac{\sum_i^n (X_i – \bar{X})^2}{n-1}$ turns out to be an unbiased estimator for $\frac{\sum_i^N (X_i – \mu)^2}{N-1}$.

I want to know if what I think is correct. Thank you so much.

Best Answer

Yes, that statement is correct.

A convenient and rigorous way to analyze this uses the indicator function of the sample. That is, define the random variable $I_i$ to be $1$ when $x_i$ is in the sample and $0$ otherwise. Clearly the sampling procedure defines the $I_i$ and conversely.

Because the sample size is fixed at $n,$

$$\sum_{i=1}^N I_i = n$$

is constant. We can deduce everything we need from this simple fact.

We can easily compute the moments of these random variables, because they are interchangeable: in particular, the distributions of any $I_i$ and $I_j$ must be the same, $I_i^2 = I_i,$ and the joint distributions of any of the ordered pairs $(I_i,I_j),$ for $i\ne j,$ must all be the same. This justifies writing

$$E[I_i^2] = E[I_i] = p\text{ and } E[I_iI_j] = q$$

for some fixed numbers $p$ and $q$ (assuming $N\ge 2;$ the case $N=1$ is not in the scope of the question). Moreover, the sample size determines both these expectations, since

$$n = E[n] = E\left[\sum_{i=1}^N I_i\right] = \sum_{i=1}^N E[I_i] = Np$$

and

$$n^2 = E[n^2] = E\left[\left(\sum_{i=1}^N I_i\right)^2\right] = \sum_{i=1}^N E[I_i^2] + N(N-1)\sum_{i\ne j} E[I_iI_j] = Np + N(N-1)q$$

imply

$$p = \frac{n}{N}\quad\text{and}\quad q =\frac{n}{N}\frac{n-1}{N-1}.$$

The rest is algebra--essentially, re-arranging sums and products. Here are the details.

The sample variance is a random variable determined by the sample. It can be expressed in terms of the $I_i$ and population values $x_i$ as

$$\begin{aligned} &\sum_{i=1}^N (x_i I_i )^2 - \frac{1}{n}\left(\sum_{i=1}^N x_iI_i\right)^2 \\ &= \sum_{i=1}^N x_i^2 I_i - \frac{1}{n}\sum_{i=1}^N x_i^2 I_i - \frac{1}{n}\sum_{i\ne j}^N x_i x_jI_iI_j . \end{aligned}$$

all divided by $n-1.$

The expectation of this therefore equals

$$\begin{aligned} \sum_{i=1}^N x_i^2 E[I_i] - \frac{1}{n}\sum_{i=1}^N x_i^2 E[I_i] - \frac{1}{n}\sum_{i\ne j}^N x_i x_jE[I_iI_j]\\ =\sum_{i=1}^N x_i^2 \frac{n}{N} - \frac{1}{n}\sum_{i=1}^N x_i^2 \frac{n}{N} - \frac{1}{N}\sum_{i\ne j}^N x_i x_j\frac{n}{N}\frac{n-1}{N-1}\\ =\frac{n-1}{N}\left[\sum_{i=1}^N x_i^2 - \frac{1}{N-1}\sum_{i\ne j}^N x_i x_j\right]\\ =\frac{n-1}{N-1}\left[\sum_{i=1}^N x_i^2 - \frac{1}{N}\left(\sum_{i=1}^N x_i\right)^2\right] \\ =(n-1) \frac{1}{N-1}\sum_{i=1}^N (x_i - \mu)^2. \end{aligned}$$

where, as in the question, $N\mu$ is the sum of all the $x_i.$

Upon dividing by $n-1,$ we obtain the stated result.