I am currently self-learning hypothesis testing and am looking at the independent samples t-test whose test statistic involves the pooled sample variance (https://libguides.library.kent.edu/spss/independentttest),
$$ S_p^2 = \frac{(n_1 – 1)S^2_1+(n_2-1)S_2^2}{n_1+n_2-2},$$
where $n_1, n_2$ are the sample sizes of the two samples and $S_1^2, S_2^2$ their respective sample variance. This test assumes that $S_1^2 = S_2^2$.
I understand that the pooled sampled variance is computed as a weighted average with weights $w_i = n_i -1$ for $i=1,2$. However I am unsure why $n_i-1$ is used as a weight instead of $n_i$. I understand that the $n-1$ is used instead of $n$ so that the usual sample variance is an unbiased estimator of the variance (Bessel's correction) but I cannot see why it is necessary for the pooled sample variance since the statistic
$$ \frac{n_1S^2_1+n_2S^2_2}{n_1+n_2} $$
is also an unbiased estimator.
Can anyone explain this to me? Thanks.
Best Answer
For a two-sample t test on samples from populations with the same variance $\sigma^2,$ you have two proposed variance estimates
$$ S_p^2 = \frac{(n_1 - 1)S^2_1+(n_2-1)S_2^2}{n_1+n_2-2},$$
and
$$ S_a^2 = \frac{(n_1S^2_1+n_2)S^2_2}{n_1+n_2}. $$
For $S_p^2,$ you have found $S_i^2; i=1,2,$ each of which requires computing a sample mean $\bar X_i, 1,2.$ So,
$$ \frac{\nu S_p^2}{\sigma^2} \sim \mathsf{Chisq(\nu)}.$$ where $\nu = n_1+n_2 - 2.$
For $S_a^2,$ the distribution theory is not so clear. You say something about $S_a^2$ being unbiased, but that hardly specifies a distribution. Let's use The same degrees of freedom $\nu$ as above for an experiment.
Simulation: Begin by looking at $m = 10\,000$ samples
x1
of size $n_1 = 2$ from $\mathsf{Norm}(\mu_1 = 100, \sigma_1 = 15)$ andx2
of size $n_2=3$ from $\mathsf{Norm}(\mu_2 = 110, \sigma_2 = 15).$We find the sample variances, the pooled variance estimat and the average variance estimate. Then we look at the corresponding chi-squared random variables.
Then we compare the results with the density functions of the corresponding chi-squared distribution. For the pooled estimate $S_p^2$ we get a good match, but for $S_a^2$ the fit is not good.
R code for graphs: