Hypothesis Testing – Explanation and Applications of Pooled Variance

distributionshypothesis testingnormal distributiont-distributionvariance

If I am conducting a difference in means hypothesis test, when do we use the pooled variance and why?

Lets say the population variance was unknown for two samples, the sample sizes for the two means were small (around 20) and they follow a normal distribution. Therefore I would be using a t distribution. In this case, can't we just do a difference of means hypothesis test by adding the variances together e.g Var(X)/n1 + Var(Y)/n2 and then square rooting, (as shown below).

$t = \frac{(\bar{X}_{x}- \bar{X}_{y})-(\mu_{x}-\mu_{y})}{\sqrt{\left(\frac{\sigma^{2}_{x}}{n_x}+ \frac{\sigma^{2}_{y}}{n_y} \right)}}$

Since we can add variances for independent random variables, why is it necessary to pool? Similarly, for difference of means tests where a Z statistic is calculated (samples size large and true sample variances are known), why is it that the variances are never pooled and are added instead?

Please can someone explain what I am missing here.

Best Answer

If you have good reasons to believe that the variance of the two populations are equal, then it makes sense to use this information to improve the efficiency of your estimate.

In this case, your test statistic becomes:

$$t=\frac{(\bar{x}_x-\bar{x}_y)-(\mu_x-\mu_y)}{s\sqrt{\frac{1}{n_x}+\frac{1}{n_y}}}$$

So instead of having to estimate two variances, $\sigma_x^2$ and $\sigma_y^2$, you now have to estimate only one, $\sigma^2$.

In principle you could use any of the two sample variance estimates, but this would be ignoring part of the available information. Surely we can do better than that and combine the information from the two samples.

One way to combine the variances estimates of different samples in an unbiased way is to use the pooled variance estimate:

$$s_{pooled}^2 = \frac{(n_x-1)s_x^2 + (n_y-1)s_y^2}{n_x+n_y-2}$$

Where $s_x^2$ and $s_y^2$ are the unbiased sample variance estimates: $s_x^2 = \frac{1}{n_x-1}\sum_{i=1}^{n_x}(x_i-\bar{x}_x)^2$ (similarly for $s_y^2)$.

Edited after I understood the second part of your question:

In addition, do not confuse:

The pooled variance $s^2_{pooled}$, as above, which is an estimate of $\sigma^2$
The variance of the difference of two sample means, with sample size $n_x$ and $n_y$ and corresponding variance $\sigma_x^2$ and $\sigma_y^2$, which is: $var(\bar{x}_x-\bar{x}_y)=\frac{\sigma_x^2}{n_x}+\frac{\sigma_y^2}{n_y}$.

Note that the latter, which you find in square roots in the denominator of your t statistic, is the variance of the value of interest - the difference in the two means. It does not have to do with estimating a variance; rather, it has to do with standardizing your statistic.

Related Solutions

Solved – How to intuitively understand formula for estimate of pooled variance when testing differences between group means

There are really 2 questions here, one about pooling and one about degrees of freedom.

Let's look at degrees of freedom first. To get the concept consider if we know that $x+y+z=10$ Then $x$ can be anything we want, and $y$ can be anything we want, but once we set those 2 there is only one value that $z$ can be, so we have 2 degrees of freedom. When we calculate $S^2$ if we subtract the population mean from each $x_i$ then square and sum, then we would divide by $n$ taking the average squared difference. But we generally don't know the population mean so we subtract the sample mean as an estimate of the population mean. But subtracting the sample mean that is estimated from the same data as we are using to find $S^2$ guarentees the lowest possible sum of squares, so it will tend to be too small. But if we divide by $n-1$ instead then it is unbiased because we have taken into account that we already used the same data to compute one piece of information (the mean is just the sum divided by a constant). In regression models the degrees of freedom are equal to $n$ minus the number of parameters we estimate. Each time you estimate a parameter (mean, intercept, slope) you are spending 1 degree of freedom.

For the pooled variance function, $S^2_c$ and $S^2_t$ are already divided by $n_c-1$ and $n_t-1$, so the multiplying just gives the sums of squares, then we add the 2 sums of squares and divide by the total degrees of freedom (we subtract 2 because we estimated 2 sample means to get the sums of squares). The pooled variance is just a weighted average of the 2 variances.

Hypothesis Testing – Why Use n-1 Instead of n in Pooled Sample Variance?

For a two-sample t test on samples from populations with the same variance $\sigma^2,$ you have two proposed variance estimates

$$ S_p^2 = \frac{(n_1 - 1)S^2_1+(n_2-1)S_2^2}{n_1+n_2-2},$$

and

$$ S_a^2 = \frac{(n_1S^2_1+n_2)S^2_2}{n_1+n_2}. $$

For $S_p^2,$ you have found $S_i^2; i=1,2,$ each of which requires computing a sample mean $\bar X_i, 1,2.$ So,

$$ \frac{\nu S_p^2}{\sigma^2} \sim \mathsf{Chisq(\nu)}.$$ where $\nu = n_1+n_2 - 2.$

For $S_a^2,$ the distribution theory is not so clear. You say something about $S_a^2$ being unbiased, but that hardly specifies a distribution. Let's use The same degrees of freedom $\nu$ as above for an experiment.

Simulation: Begin by looking at $m = 10\,000$ samples x1 of size $n_1 = 2$ from $\mathsf{Norm}(\mu_1 = 100, \sigma_1 = 15)$ and x2 of size $n_2=3$ from $\mathsf{Norm}(\mu_2 = 110, \sigma_2 = 15).$
We find the sample variances, the pooled variance estimat and the average variance estimate. Then we look at the corresponding chi-squared random variables.

set.seed(2022)
n1 = 2; m=10^5
M1 = matrix(rnorm(n1*m, 100, 15), nrow=m)
v1 = apply(M1, 1, var)
n2 = 3
M2 = matrix(rnorm(n2*m, 110, 15), nrow=m)

v2 = apply(M2, 1, var)

pool = (v1 + 2*v2)/(n1+n2-2)
q.p = (n1+n2-2)*pool/15^2
avg.v = (v1+v2)/(n1+n2) ####
q.a = (n1+n2)*avg.v/15^2

Then we compare the results with the density functions of the corresponding chi-squared distribution. For the pooled estimate $S_p^2$ we get a good match, but for $S_a^2$ the fit is not good.

R code for graphs:

par(mfrow=c(1,2))
 hist(q.p, prob=T, ylim=c(0,.35), col="skyblue2", main="Pooled")
  curve(dchisq(x, n1+n2-2), add=T, lwd=2, col="orange")

 hist(q.a, prob=T, ylim=c(0,.35), col="skyblue2", main="Averaged")
  curve(dchisq(x, n1+n2-1), add=T, lwd=2, col="orange")
par(mfrow=c(1,1))

Best Answer

Related Solutions

Solved – How to intuitively understand formula for estimate of pooled variance when testing differences between group means

Hypothesis Testing – Why Use n-1 Instead of n in Pooled Sample Variance?

Related Question