Solved – How to intuitively understand formula for estimate of pooled variance when testing differences between group means

degrees of freedommeanvariance

Suppose I want to compare the difference between means of samples selected from two populations (the treatment and control). Assume both groups have normally distributed observations. Then $$Z = \frac{(\bar{X}_{t}- \bar{X}_{c})-(\mu_{t}-\mu_{c})}{\sqrt{\left(\frac{\sigma^{2}_{t}}{n_t}+ \frac{\sigma^{2}_{c}}{n_c} \right)}}$$

Suppose that $\sigma_{t}^{2}$ and $\sigma_{c}^{2}$ are unknown but can be assumed equal to $\sigma^2$. Why is the pooled estimate $S_{p}^{2}$ for $\sigma^2$ equal to $$S_{p}^{2} = \frac{S_{t}^{2}(n_{t}-1)+ S_{c}^{2}(n_{c}-1)}{[n_t+n_c-2]}$$ where $S_{t}^2$ and $S_{c}^2$ are the sample estimates of the treatment and control groups. I know this has something to do with degrees of freedom. But I never could really "grok" its definition.

In short, how do we get the pooled estimate and what are degrees of freedom intuitively?

Best Answer

There are really 2 questions here, one about pooling and one about degrees of freedom.

Let's look at degrees of freedom first. To get the concept consider if we know that $x+y+z=10$ Then $x$ can be anything we want, and $y$ can be anything we want, but once we set those 2 there is only one value that $z$ can be, so we have 2 degrees of freedom. When we calculate $S^2$ if we subtract the population mean from each $x_i$ then square and sum, then we would divide by $n$ taking the average squared difference. But we generally don't know the population mean so we subtract the sample mean as an estimate of the population mean. But subtracting the sample mean that is estimated from the same data as we are using to find $S^2$ guarentees the lowest possible sum of squares, so it will tend to be too small. But if we divide by $n-1$ instead then it is unbiased because we have taken into account that we already used the same data to compute one piece of information (the mean is just the sum divided by a constant). In regression models the degrees of freedom are equal to $n$ minus the number of parameters we estimate. Each time you estimate a parameter (mean, intercept, slope) you are spending 1 degree of freedom.

For the pooled variance function, $S^2_c$ and $S^2_t$ are already divided by $n_c-1$ and $n_t-1$, so the multiplying just gives the sums of squares, then we add the 2 sums of squares and divide by the total degrees of freedom (we subtract 2 because we estimated 2 sample means to get the sums of squares). The pooled variance is just a weighted average of the 2 variances.