T-Distribution – Why Do Normal Distribution and T Distribution have Different Standard Errors?

harmonic meanhypothesis testingt-distribution

In the context of t-distribution confidence intervals, when computing the standard error of the difference of means when population sizes are uneven, the harmonic mean of the sample sizes must be used.

standard error normal dist mean: $\sqrt{ \frac{s_1^2}{n_1} + \frac{s_2^2}{n_2} }$

Standard error T-distribution: $\sqrt{ \frac{(n_1 -1)s_1^2 + (n_2 -1)s_2^2}{n_1 + n_2 -2} (\frac{1}{n_1} + \frac{1}{n_2} )}$

I gather the difference between the two is due to the degrees of freedom, which is part of the T-distribution's PDF. But I'm a little lost how it was derived?

Edit: Came across a source, http://www.math.iup.edu/~clamb/class/math217/5_3-two-means/ and now I'm a little more confused. It discusses that the latter formula is appropriate when the analyst wants to pool standard deviation across samples; whereas the former formula is appropriate when one does NOT want to pool standard deviation.

Why would one choose to or not to pool standard deviation?

Source: https://onlinestatbook.com/2/estimation/difference_means.html

Best Answer

TLDR The difference between the two situations is whether you use $\sqrt{\frac{s_a^2}{n_a} + \frac{s_b^2}{n_b}}$ or $\sqrt{\frac{s^2}{n_a} + \frac{s^2}{n_b}}$.

In the second case the estimate of the variance of the populations $a$ and $b$ is coupled based on the assumption that the populations have equal variance.

The nasty looking formula in the second case stems from deriving the pooled sample deviation $s$ based on the individual sample deviations $s_a$ and $s_b$.


The formula's might become more intuitive when you consider the sum of the squared residuals from which the sample standard deviation is derived.

$$ \sum_{i=1}^n {r_i^2} = \sum_{i=1}^n (x_i - \bar{x})^2$$

This $r_i^2$ is a sum of $n$ terms but effectively equivalent to the sum of $n-1$ squared independent normal distributed variables with variance $\sigma$. (See for instance Why are the residuals in $\mathbb{R}^{n-p}$?)

The distribution of squared independent normal distributed variables with standard deviation $\sigma$ follows a gamma distribution* with shape parameter $n-1$ and scale parameter $\sigma^2$ and will have a mean of $(n-1)\sigma^2$. So if we divide by $n-1$ then we have an unbiased estimate of the variance.

$$s^2 = \hat{\sigma^2} = \frac{1}{n-1} \sum_{i=1}^n {r_i^2}$$

And $s$ is the corrected sample estimate of the standard deviation

$$s = \sqrt{\frac{1}{n-1} \sum_{i=1}^n {r_i^2}}$$

Now, if we have more residual terms because we sampled two population where we assume that the two populations have the same variance (that is what the pooling does). Then we get simply a sum of those residual terms and now it is a gamma distributed variable that is equivalent to a sum of $(n_a - 1)+(n_b -1)$ squared normal distributed variables.

$$s = \sqrt{\frac{1}{(n_a-1)+(n_b-1)} \left(\sum_{i=1}^{n_a} {r_{a,i}^2} + \sum_{i=1}^{n_b} {r_{b,i}^2}\right)}$$

where $n_a$ and $n_b$ are the sizes of the two samples and $r_{a,i}$ and $r_{b,i}$ the residual terms in the two samples.

If instead of the sum of squared residuals you use the corrected sample standard deviations $$\sum_{i=1}^{n_a} {r_{a,i}^2} = s_a^2 (n_a -1)\\ \sum_{i=1}^{n_b} {r_{a,i}^2} = s_b^2 (n_b -1)$$

then you get

$$s = \sqrt{\frac{s_a^2 (n_a -1) +s_b^2 (n_b -1)}{(n_a-1)+(n_b-1)}}$$

The additional term $\sqrt{\frac{1}{n_a}+\frac{1}{n_b}}$ is to convert from an estimate about the variance/deviation of the population to the variance/deviation of the sample mean or the difference between two sample means. The estimate of the variance of the one mean will be $s^2/n_a$ and the other $s^2/n_b$. The estimate for the variance of the difference is the sum of those two.


*Some readers might be more familiar with the $\chi^2$ distribution which is a special case of the gamma distribution, but with a scale equal to 1.

Related Question