T-Distribution – Why Do Normal Distribution and T Distribution have Different Standard Errors?

harmonic meanhypothesis testingt-distribution

In the context of t-distribution confidence intervals, when computing the standard error of the difference of means when population sizes are uneven, the harmonic mean of the sample sizes must be used.

standard error normal dist mean: $\sqrt{ \frac{s_1^2}{n_1} + \frac{s_2^2}{n_2} }$

Standard error T-distribution: $\sqrt{ \frac{(n_1 -1)s_1^2 + (n_2 -1)s_2^2}{n_1 + n_2 -2} (\frac{1}{n_1} + \frac{1}{n_2} )}$

I gather the difference between the two is due to the degrees of freedom, which is part of the T-distribution's PDF. But I'm a little lost how it was derived?

Edit: Came across a source, http://www.math.iup.edu/~clamb/class/math217/5_3-two-means/ and now I'm a little more confused. It discusses that the latter formula is appropriate when the analyst wants to pool standard deviation across samples; whereas the former formula is appropriate when one does NOT want to pool standard deviation.

Why would one choose to or not to pool standard deviation?

Source: https://onlinestatbook.com/2/estimation/difference_means.html

Best Answer

TLDR The difference between the two situations is whether you use $\sqrt{\frac{s_a^2}{n_a} + \frac{s_b^2}{n_b}}$ or $\sqrt{\frac{s^2}{n_a} + \frac{s^2}{n_b}}$.

In the second case the estimate of the variance of the populations $a$ and $b$ is coupled based on the assumption that the populations have equal variance.

The nasty looking formula in the second case stems from deriving the pooled sample deviation $s$ based on the individual sample deviations $s_a$ and $s_b$.

The formula's might become more intuitive when you consider the sum of the squared residuals from which the sample standard deviation is derived.

$$ \sum_{i=1}^n {r_i^2} = \sum_{i=1}^n (x_i - \bar{x})^2$$

This $r_i^2$ is a sum of $n$ terms but effectively equivalent to the sum of $n-1$ squared independent normal distributed variables with variance $\sigma$. (See for instance Why are the residuals in $\mathbb{R}^{n-p}$?)

The distribution of squared independent normal distributed variables with standard deviation $\sigma$ follows a gamma distribution* with shape parameter $n-1$ and scale parameter $\sigma^2$ and will have a mean of $(n-1)\sigma^2$. So if we divide by $n-1$ then we have an unbiased estimate of the variance.

$$s^2 = \hat{\sigma^2} = \frac{1}{n-1} \sum_{i=1}^n {r_i^2}$$

And $s$ is the corrected sample estimate of the standard deviation

$$s = \sqrt{\frac{1}{n-1} \sum_{i=1}^n {r_i^2}}$$

Now, if we have more residual terms because we sampled two population where we assume that the two populations have the same variance (that is what the pooling does). Then we get simply a sum of those residual terms and now it is a gamma distributed variable that is equivalent to a sum of $(n_a - 1)+(n_b -1)$ squared normal distributed variables.

$$s = \sqrt{\frac{1}{(n_a-1)+(n_b-1)} \left(\sum_{i=1}^{n_a} {r_{a,i}^2} + \sum_{i=1}^{n_b} {r_{b,i}^2}\right)}$$

where $n_a$ and $n_b$ are the sizes of the two samples and $r_{a,i}$ and $r_{b,i}$ the residual terms in the two samples.

If instead of the sum of squared residuals you use the corrected sample standard deviations $$\sum_{i=1}^{n_a} {r_{a,i}^2} = s_a^2 (n_a -1)\\ \sum_{i=1}^{n_b} {r_{a,i}^2} = s_b^2 (n_b -1)$$

then you get

$$s = \sqrt{\frac{s_a^2 (n_a -1) +s_b^2 (n_b -1)}{(n_a-1)+(n_b-1)}}$$

The additional term $\sqrt{\frac{1}{n_a}+\frac{1}{n_b}}$ is to convert from an estimate about the variance/deviation of the population to the variance/deviation of the sample mean or the difference between two sample means. The estimate of the variance of the one mean will be $s^2/n_a$ and the other $s^2/n_b$. The estimate for the variance of the difference is the sum of those two.

*Some readers might be more familiar with the $\chi^2$ distribution which is a special case of the gamma distribution, but with a scale equal to 1.

Related Solutions

Hypothesis Testing – Why Use n-1 Instead of n in Pooled Sample Variance?

For a two-sample t test on samples from populations with the same variance $\sigma^2,$ you have two proposed variance estimates

$$ S_p^2 = \frac{(n_1 - 1)S^2_1+(n_2-1)S_2^2}{n_1+n_2-2},$$

and

$$ S_a^2 = \frac{(n_1S^2_1+n_2)S^2_2}{n_1+n_2}. $$

For $S_p^2,$ you have found $S_i^2; i=1,2,$ each of which requires computing a sample mean $\bar X_i, 1,2.$ So,

$$ \frac{\nu S_p^2}{\sigma^2} \sim \mathsf{Chisq(\nu)}.$$ where $\nu = n_1+n_2 - 2.$

For $S_a^2,$ the distribution theory is not so clear. You say something about $S_a^2$ being unbiased, but that hardly specifies a distribution. Let's use The same degrees of freedom $\nu$ as above for an experiment.

Simulation: Begin by looking at $m = 10\,000$ samples x1 of size $n_1 = 2$ from $\mathsf{Norm}(\mu_1 = 100, \sigma_1 = 15)$ and x2 of size $n_2=3$ from $\mathsf{Norm}(\mu_2 = 110, \sigma_2 = 15).$
We find the sample variances, the pooled variance estimat and the average variance estimate. Then we look at the corresponding chi-squared random variables.

set.seed(2022)
n1 = 2; m=10^5
M1 = matrix(rnorm(n1*m, 100, 15), nrow=m)
v1 = apply(M1, 1, var)
n2 = 3
M2 = matrix(rnorm(n2*m, 110, 15), nrow=m)

v2 = apply(M2, 1, var)

pool = (v1 + 2*v2)/(n1+n2-2)
q.p = (n1+n2-2)*pool/15^2
avg.v = (v1+v2)/(n1+n2) ####
q.a = (n1+n2)*avg.v/15^2

Then we compare the results with the density functions of the corresponding chi-squared distribution. For the pooled estimate $S_p^2$ we get a good match, but for $S_a^2$ the fit is not good.

R code for graphs:

par(mfrow=c(1,2))
 hist(q.p, prob=T, ylim=c(0,.35), col="skyblue2", main="Pooled")
  curve(dchisq(x, n1+n2-2), add=T, lwd=2, col="orange")

 hist(q.a, prob=T, ylim=c(0,.35), col="skyblue2", main="Averaged")
  curve(dchisq(x, n1+n2-1), add=T, lwd=2, col="orange")
par(mfrow=c(1,1))

Hypothesis Testing – Proof of Student t-test for Independent Samples with Non-zero Mean

The simple answer is that the statistic does not depend at all on $\mu$, and this is much easier to see from the original, non-transformed formula:

$$t = \frac{\overline x - \overline y}{\sqrt{\frac{n_1 s_1^2 + n_2 s_2^2}\nu} \sqrt{\frac1{n_1} + \frac1{n_2}}}.$$

Indeed under the transformations $x_i \mapsto x_i - \mu$ and $y_i \mapsto y_i - \mu$, we have $\overline x \mapsto \overline x - \mu$ and $\overline y \mapsto \overline y - \mu$, and also $s_1^2 = \frac1n \sum (x_i - \overline x)^2 \mapsto s_1^2$ and similarly $s_2^2 \mapsto s_2^2$, so that the variable $t$ is invariant under horizontal shifts of the parent distribution. This is why we can assume $\mu = 0$ without loss of generality.

Best Answer

Related Solutions

Hypothesis Testing – Why Use n-1 Instead of n in Pooled Sample Variance?

Hypothesis Testing – Proof of Student t-test for Independent Samples with Non-zero Mean

Related Question