Equal vs Unequal Variance T-tests: Clarification and Differences

f-testhypothesis testingt-test

In unequal variance t-test (Welch t-test):

$$H_0 = \text{No difference in means, but variance can differ}$$
$$H_1 = \text{Two sample means are significantly different}$$

I don't see the point of unequal variance test. Even though sample means are the same, but if the variance is different, what does it tell us?

Please address this question with the following case studies.

Case 1: two different medical procedure was applied on the same group of patients. How to test if two procedures are significantly different from each other?

Case 2: one class taught by the same teacher is split into two groups and take exams. But the supervisor who has the exam result doesn't know about this. He wants to know if the two groups (samples) came from the same classroom (population). What does unequal variance test do here?

I also read that F-test is used to test difference in variance. How does F-test relate to unequal, or equal variance test?

Best Answer

The alternative to the Welch 2-sample t test is the pooled 2-sample t test. In order for the pooled test to give reliable results, it is necessary for population variances to be equal. But the Welch test works well--whether or not the variances are equal.

Pooled t test. If I have a sample of size 10 from $\mathsf{Norm}(\mu = 50, \sigma=8)$ and a sample of size 30 from $\mathsf{Norm}(\mu = 50, \sigma=8),$ then the pooled 2-sample t test (with a critical value chosen for level $\alpha = 0.05)$ has probability 5% of rejecting $H_0: \mu_1 = \mu_2$ vs $H_a: \mu_1 \ne \mu_2.$ This is as it should be for a test at the 5% level of significance.

set.seed(615)  # means equal, variances equal
pv = replicate(10^5, t.test(rnorm(10,50,8), rnorm(30,50,8), var.eq=T)$p.val )
mean(pv < .05)
[1] 0.0501     # as should be

However, if I have a sample of size 10 from $\mathsf{Norm}(\mu = 50, \sigma=8)$ and a sample of size 30 from $\mathsf{Norm}(\mu = 60, \sigma=8),$ then the pooled 2-sample t test has a high probability of rejecting $H_0: \mu_1 = \mu_2$ vs $H_a: \mu_1 \ne \mu_2.$ In the simulation below we see that this probability, called the 'power', is about 92%.

set.seed(616)  # mean unequal, variances equal
pv = replicate(10^5, t.test(rnorm(10,50,8), rnorm(30,60,8), var.eq=T)$p.val )
mean(pv < .05)
[1] 0.91576    # very good power

So the pooled t test works well when variances are known to be equal.

But what happens if the means are equal and the variances are unequal with $\sigma_1 = 10$ in the first population and with $\sigma_2 = 5$ in the second population?

Then what ought to be a test at the 5% level has become a test at about the 15% level. So I'll falsely believe means are unequal when they really are equal. As a result, I might publish some false "discoveries."

set.seed(617)  # mean equal, variances unequal
pv = replicate(10^5, t.test(rnorm(10,50,10), rnorm(30,50,5), var.eq=T)$p.val )
mean(pv < .05)
[1] 0.15408    # excessively high probability of Type I error

Welch t test. By contrast, the Welch test uses a modified t statistic, (usually) with a smaller number of degrees of freedom, in order to get a test close to the 5% level. [Note that in the R procedure t.test, removing the argument var.eq=T changes the procedure from a pooled to a Welch test.]

set.seed(618)  # Welch with mean equal, variances unequal
pv = replicate(10^5, t.test(rnorm(10,50,10), rnorm(30,50,5))$p.val )
mean(pv < .05)
[1] 0.05169    # as it should be

Moreover, the Welch test still does a pretty good job of detecting when means are unequal: it has power about 79%.

set.seed(619)  # Welch with mean unequal, variances unequal
pv = replicate(10^5, t.test(rnorm(10,50,10), rnorm(30,60,5))$p.val )
mean(pv < .05)
[1] 0.78657    # reasonably good power

What's the point? In conclusion, the point of using the Welch test is that performs well even if population variances are not equal. In practice, one usually doesn't know whether or not population variances are equal. So good statistical practice is to use the Welch version of the two-sample t test, unless one has reliable prior evidence that population variances are equal.

Note: The F-test for unequal variances has poor power. It should not be used to 'screen' whether to use the pooled or the Welch test. If there is any uncertainty about unequal variances, automatically use the Welch test.