Hypothesis Testing – Why Use n-1 Instead of n in Pooled Sample Variance?

hypothesis testingvariance

I am currently self-learning hypothesis testing and am looking at the independent samples t-test whose test statistic involves the pooled sample variance (https://libguides.library.kent.edu/spss/independentttest),
$$ S_p^2 = \frac{(n_1 – 1)S^2_1+(n_2-1)S_2^2}{n_1+n_2-2},$$
where $n_1, n_2$ are the sample sizes of the two samples and $S_1^2, S_2^2$ their respective sample variance. This test assumes that $S_1^2 = S_2^2$.

I understand that the pooled sampled variance is computed as a weighted average with weights $w_i = n_i -1$ for $i=1,2$. However I am unsure why $n_i-1$ is used as a weight instead of $n_i$. I understand that the $n-1$ is used instead of $n$ so that the usual sample variance is an unbiased estimator of the variance (Bessel's correction) but I cannot see why it is necessary for the pooled sample variance since the statistic
$$ \frac{n_1S^2_1+n_2S^2_2}{n_1+n_2} $$
is also an unbiased estimator.

Can anyone explain this to me? Thanks.

Best Answer

For a two-sample t test on samples from populations with the same variance $\sigma^2,$ you have two proposed variance estimates

$$ S_p^2 = \frac{(n_1 - 1)S^2_1+(n_2-1)S_2^2}{n_1+n_2-2},$$

and

$$ S_a^2 = \frac{(n_1S^2_1+n_2)S^2_2}{n_1+n_2}. $$

For $S_p^2,$ you have found $S_i^2; i=1,2,$ each of which requires computing a sample mean $\bar X_i, 1,2.$ So,

$$ \frac{\nu S_p^2}{\sigma^2} \sim \mathsf{Chisq(\nu)}.$$ where $\nu = n_1+n_2 - 2.$

For $S_a^2,$ the distribution theory is not so clear. You say something about $S_a^2$ being unbiased, but that hardly specifies a distribution. Let's use The same degrees of freedom $\nu$ as above for an experiment.

Simulation: Begin by looking at $m = 10\,000$ samples x1 of size $n_1 = 2$ from $\mathsf{Norm}(\mu_1 = 100, \sigma_1 = 15)$ and x2 of size $n_2=3$ from $\mathsf{Norm}(\mu_2 = 110, \sigma_2 = 15).$
We find the sample variances, the pooled variance estimat and the average variance estimate. Then we look at the corresponding chi-squared random variables.

set.seed(2022)
n1 = 2; m=10^5
M1 = matrix(rnorm(n1*m, 100, 15), nrow=m)
v1 = apply(M1, 1, var)
n2 = 3
M2 = matrix(rnorm(n2*m, 110, 15), nrow=m)

v2 = apply(M2, 1, var)

pool = (v1 + 2*v2)/(n1+n2-2)
q.p = (n1+n2-2)*pool/15^2
avg.v = (v1+v2)/(n1+n2) ####
q.a = (n1+n2)*avg.v/15^2

Then we compare the results with the density functions of the corresponding chi-squared distribution. For the pooled estimate $S_p^2$ we get a good match, but for $S_a^2$ the fit is not good.

enter image description here

R code for graphs:

par(mfrow=c(1,2))
 hist(q.p, prob=T, ylim=c(0,.35), col="skyblue2", main="Pooled")
  curve(dchisq(x, n1+n2-2), add=T, lwd=2, col="orange")

 hist(q.a, prob=T, ylim=c(0,.35), col="skyblue2", main="Averaged")
  curve(dchisq(x, n1+n2-1), add=T, lwd=2, col="orange")
par(mfrow=c(1,1))