Hypothesis Testing – Explanation and Applications of Pooled Variance

distributionshypothesis testingnormal distributiont-distributionvariance

If I am conducting a difference in means hypothesis test, when do we use the pooled variance and why?

Lets say the population variance was unknown for two samples, the sample sizes for the two means were small (around 20) and they follow a normal distribution. Therefore I would be using a t distribution. In this case, can't we just do a difference of means hypothesis test by adding the variances together e.g Var(X)/n1 + Var(Y)/n2 and then square rooting, (as shown below).

$t = \frac{(\bar{X}_{x}- \bar{X}_{y})-(\mu_{x}-\mu_{y})}{\sqrt{\left(\frac{\sigma^{2}_{x}}{n_x}+ \frac{\sigma^{2}_{y}}{n_y} \right)}}$

Since we can add variances for independent random variables, why is it necessary to pool? Similarly, for difference of means tests where a Z statistic is calculated (samples size large and true sample variances are known), why is it that the variances are never pooled and are added instead?

Please can someone explain what I am missing here.

Best Answer

If you have good reasons to believe that the variance of the two populations are equal, then it makes sense to use this information to improve the efficiency of your estimate.

In this case, your test statistic becomes:

$$t=\frac{(\bar{x}_x-\bar{x}_y)-(\mu_x-\mu_y)}{s\sqrt{\frac{1}{n_x}+\frac{1}{n_y}}}$$

So instead of having to estimate two variances, $\sigma_x^2$ and $\sigma_y^2$, you now have to estimate only one, $\sigma^2$.

In principle you could use any of the two sample variance estimates, but this would be ignoring part of the available information. Surely we can do better than that and combine the information from the two samples.

One way to combine the variances estimates of different samples in an unbiased way is to use the pooled variance estimate:

$$s_{pooled}^2 = \frac{(n_x-1)s_x^2 + (n_y-1)s_y^2}{n_x+n_y-2}$$

Where $s_x^2$ and $s_y^2$ are the unbiased sample variance estimates: $s_x^2 = \frac{1}{n_x-1}\sum_{i=1}^{n_x}(x_i-\bar{x}_x)^2$ (similarly for $s_y^2)$.


Edited after I understood the second part of your question:

In addition, do not confuse:

  • The pooled variance $s^2_{pooled}$, as above, which is an estimate of $\sigma^2$
  • The variance of the difference of two sample means, with sample size $n_x$ and $n_y$ and corresponding variance $\sigma_x^2$ and $\sigma_y^2$, which is: $var(\bar{x}_x-\bar{x}_y)=\frac{\sigma_x^2}{n_x}+\frac{\sigma_y^2}{n_y}$.

Note that the latter, which you find in square roots in the denominator of your t statistic, is the variance of the value of interest - the difference in the two means. It does not have to do with estimating a variance; rather, it has to do with standardizing your statistic.

Related Question