I'm not sure I'm calculating the unbiased pooled estimator for the variance correctly.
Assuming 2 samples where $\sigma_1 = \sigma_2 = \sigma$ and is uknown, these are my definitions:
Sample variance: $S^2 = {1\over{n}} \sum{(X_i – \bar{X})^2}$
Unbiased estimator: $\hat{S^2} = {n\over{n-1}}S^2 = {1\over{n-1}} \sum{(X_i – \bar{X})^2}$
Unbiased pooled variance: ${{(n_1 – 1)\hat{S_1^2} + (n_2 – 1)\hat{S_2^2}}\over{(n_1 – 1) + (n_2 – 1)}} = {{n_1S_1^2 + n_2S_2^2}\over{n_1 + n_2 -2}}$
The last equation, which should give the unbiased pooled estimate, reduces to:
${\sum{(X_{1i} – \bar{X})^2} + \sum{(X_{2i} – \bar{X})^2}}\over{n_1 + n_2 -2}$
Is that correct? Should I expect that the biased pooled estimate's variance will be lower than the estimated variance of each individual data set ($\underline{X_1}$ or $\underline{X_2}$)?
Best Answer
First, your notation for the sample variance seems to be muddled. The sample variance is ordinarily defined as $S^2 = \frac{1}{n-1}\sum_{i=1}^n (X_i - \bar X)^2,$ which makes it an unbiased estimator of the population variance $\sigma^2.$
Perhaps the most common context for 'unbiased pooled estimator' of variance is for the 'pooled t test': Suppose you have two random samples $X_i$ of size $n$ and $Y_i$ of size $m$ from populations with the same variance $\sigma^2.$ Then the pooled estimator of $\sigma^2$ is
$$S_p^2 = \frac{(n-1)S_X^2 + (m-1)S_Y^2}{m+n-2}.$$
This estimator is unbiased.
Because one says the samples have respective 'degrees of freedom' $n-1$ and $m-1,$ one sometimes says the $S_p^2$ is a 'degrees-of-freedom' weighted average of the two sample variances. If $n = m,$ then $S_p^2 = 0.5S_x^2 + 0.5S_Y^2.$
Note: Some authors do define the sample variance as $\frac{1}{n}\sum_{i=1}^n (X_i - \bar X)^2,$ but then the sample variance is not an unbiased estimator of $\sigma^2,$ even though it might have other properties desirable for the author's task at hand. However, most agree that the notation $S^2$ is reserved for the version with $n-1$ in the denominator, unless a specific warning is given otherwise.
Example: One common measure of the 'goodness' of an estimator is that it have a small 'root mean squared error'. If $T$ is an estimate of $\tau$ then $\text{MSE}_T(\tau) = E[(T-\tau)^2]$ and RMSE is its square root.
The simulation below illustrates for normal data with $n = 5$ and $\sigma^2 = 10^2 = 100,$ that the version of the sample variance with $n$ in the denominator has smaller RMSE than the version with $n-1$ in the denominator. (A formal proof for $n > 1$ is not difficult.)