[Math] unbiased pool estimator of variance

estimationestimation-theoryhypothesis testingstatistics

I'm not sure I'm calculating the unbiased pooled estimator for the variance correctly.

Assuming 2 samples where $\sigma_1 = \sigma_2 = \sigma$ and is uknown, these are my definitions:

Sample variance: $S^2 = {1\over{n}} \sum{(X_i – \bar{X})^2}$

Unbiased estimator: $\hat{S^2} = {n\over{n-1}}S^2 = {1\over{n-1}} \sum{(X_i – \bar{X})^2}$

Unbiased pooled variance: ${{(n_1 – 1)\hat{S_1^2} + (n_2 – 1)\hat{S_2^2}}\over{(n_1 – 1) + (n_2 – 1)}} = {{n_1S_1^2 + n_2S_2^2}\over{n_1 + n_2 -2}}$

The last equation, which should give the unbiased pooled estimate, reduces to:

${\sum{(X_{1i} – \bar{X})^2} + \sum{(X_{2i} – \bar{X})^2}}\over{n_1 + n_2 -2}$

Is that correct? Should I expect that the biased pooled estimate's variance will be lower than the estimated variance of each individual data set ($\underline{X_1}$ or $\underline{X_2}$)?

Best Answer

First, your notation for the sample variance seems to be muddled. The sample variance is ordinarily defined as $S^2 = \frac{1}{n-1}\sum_{i=1}^n (X_i - \bar X)^2,$ which makes it an unbiased estimator of the population variance $\sigma^2.$

Perhaps the most common context for 'unbiased pooled estimator' of variance is for the 'pooled t test': Suppose you have two random samples $X_i$ of size $n$ and $Y_i$ of size $m$ from populations with the same variance $\sigma^2.$ Then the pooled estimator of $\sigma^2$ is

$$S_p^2 = \frac{(n-1)S_X^2 + (m-1)S_Y^2}{m+n-2}.$$

This estimator is unbiased.

Because one says the samples have respective 'degrees of freedom' $n-1$ and $m-1,$ one sometimes says the $S_p^2$ is a 'degrees-of-freedom' weighted average of the two sample variances. If $n = m,$ then $S_p^2 = 0.5S_x^2 + 0.5S_Y^2.$


Note: Some authors do define the sample variance as $\frac{1}{n}\sum_{i=1}^n (X_i - \bar X)^2,$ but then the sample variance is not an unbiased estimator of $\sigma^2,$ even though it might have other properties desirable for the author's task at hand. However, most agree that the notation $S^2$ is reserved for the version with $n-1$ in the denominator, unless a specific warning is given otherwise.

Example: One common measure of the 'goodness' of an estimator is that it have a small 'root mean squared error'. If $T$ is an estimate of $\tau$ then $\text{MSE}_T(\tau) = E[(T-\tau)^2]$ and RMSE is its square root.

The simulation below illustrates for normal data with $n = 5$ and $\sigma^2 = 10^2 = 100,$ that the version of the sample variance with $n$ in the denominator has smaller RMSE than the version with $n-1$ in the denominator. (A formal proof for $n > 1$ is not difficult.)

set.seed(1888);  m = 10^6;  n = 5;  sigma = 10;  sg.sq = 100
v.a = replicate(m, var(rnorm(n, 100, sigma)))  # denom n-1
v.b = (n-1)*v.a/n                              # denom n
mean(v.a); RMS.a = sqrt(mean((v.a-sg.sq)^2)); RMS.a
[1] 100.0564  # 70.81563
[1] 70.81563  # larger RMSE
mean(v.b); RMS.b = sqrt(mean((v.b-sg.sq)^2)); RMS.b
[1] 80.0451   # biased   
[1] 60.06415  # smaller RMSE
Related Question