Variance – How to Estimate Population Variance from Multiple Samples

chi-distributionpopulationsamplevariance

Suppose I have $N$ samples each of size $n$, drawn from the same population, where each sample has its own sample variance $s_i^2$.

I understand that for any given sample, a first estimate of the population variance $\sigma^2$ is:

$$\hat{\sigma}^2 = s_i^2 \displaystyle\frac{n}{n-1}$$

But what if you have multiple samples? What is the best estimate of $\sigma^2$ given multiple $s^2$?

Best Answer

A comment clarified that the best estimator of the variance is intended to be unbiased. A standard way to find such an estimator is to restrict one's attention to linear combinations of the estimators (because almost anything else would be difficult to analyze).

Let there be $k$ independent samples indexed by $i=1,2,\ldots, k$, each with its own unbiased variance estimator $\hat\sigma_i^2$. Let the unknown weights of the linear combination be $w_i$, so that the combined estimator will be

$$\hat\sigma^2 = \sum_{i=1}^k w_i \hat \sigma_i^2.$$

Because this is supposed to be unbiased for any population, by definition the population variance will equal its expected value:

$$\sigma^2 = \mathbb{E}(\hat\sigma^2) = \sum_{i=1}^k w_i \mathbb{E}(\hat\sigma_i^2) = \sum_{i=1}^k w_i \sigma^2 = \left(\sum_{i=1}^k w_i\right)\sigma^2.$$

Since $\sigma^2 \ne 0$ is possible, division of both sides by $\sigma^2$ implies the weights sum to unity:

$$1 = \sum_{i=1}^k w_i.$$

Let the sample size for estimator $\hat\sigma_i$ be $n_i$. (In the question all the $n_i$ are equal to $n$.) Because each estimator $\hat\sigma_i^2$ has $n_i-1$ degrees of freedom, its variance will be approximately proportional to $1/(n-1)$ times some value that is an (unknown) property $f$ of the population. (This unknown property depends on the first four moments of the population.) In fact, for a Normal population this is not an approximation at all: the variances of the estimators are exactly proportional to $1/(n_i-1)$. Therefore we may approximate the variance of the combined estimator as

$$\operatorname{Var}(\hat\sigma^2) = \operatorname{Var}\left(\sum_{i=1}^k w_i \hat\sigma_i^2\right) = \sum_{i=1}^k w_i^2 \operatorname{Var}(\hat\sigma_i^2) \approx \sum_{i=1}^k w_i^2 \frac{f}{n_i-1}.$$ The second equality is due to the independence of the $k$ samples.

Subject to the sum-to-unity constraint, this variance is minimized when the $w_i$ are proportional to $n_i-1$. Therefore an (approximate) minimum variance unbiased linear estimator of the population variance is

$$\hat\sigma^2 = \frac{1}{n_1+\cdots n_k - k}\sum_{i=1}^k (n_i-1)\hat\sigma_i^2.$$

When all the $n_i$ are equal to a common value $n$, this reduces to the arithmetic mean of the individual variance estimators. It should be intuitively obvious that some kind of equally-weighted average of all $k$ estimators would be the best one in this case.

Related Question