Solved – Always use Welch-t test (unequal variances t-test) instead of Student-t or Mann-Whitney test

hypothesis testingt-testwilcoxon-mann-whitney-test

I want to do an AB test to check if one version does significantly increase revenue. Generically speaking I want to test whether the central tendency (mean) of 2 groups are different from each other on the basis of (unpaired) samples of the 2 groups.

My understanding is, that I could use the following approaches:

  • Student-t test, if the variance of both groups are the same and both are normally distributed
  • Welch-t test (unequal variance t-test), if the variance of both groups might not be the same but both are still normally distributed
  • Mann–Whitney U test (Wilcoxon rank-sum test), if I cannot make any assumptions about the distribution of both groups

But now I read (https://academic.oup.com/beheco/article/17/4/688/215960/The-unequal-variance-t-test-is-an-underused) that I can always use the Welch-t test. The article argues that it is dangerous to make the assumption of equal variance and additionally that

the unequal variance t-test performs as well as, or better than, the Student's t-test in terms of control of both Type I and Type II error rates whenever the underlying distributions are normal.

And if the groups are not normally distributed, than I can just rank the data beforehand:

Thus, Zimmerman and Zumbo (1993) suggest that the unequal variance t-test can effectively replace the Mann–Whitney U test if the data are first ranked before the test is applied.

So the end conclusion is:

If you want to compare the central tendency of 2 populations based on samples of unrelated data, then the unequal variance t-test should always be used in preference to the Student's t-test or Mann–Whitney U test. To use this test, first examine the distributions of the 2 samples graphically. If there is evidence of nonnormality in either or both distributions, then rank the data. Take the ranked or unranked data and perform an unequal variance t-test.

So my question is:

Do you see any drawbacks in always using the Welch-t test instead of the Student-t test or Mann-Whitney test?

Best Answer

There have been a number of papers which examine this issue. Most of them come to the conclusion that Welch's version of the t-test can be safely used in most circumstances.

The only situation in which the test seems to have undesirable performance is in very small sample sizes.

Here are some quotes from two papers which examine t-test performance with small sample sizes:

The t-test with the unequal variances option (i.e., the Welch test) was generally not preferred either. Only in the case of unequal variances combined with unequal sample sizes, where the small sample was drawn from the small variance population, did this approach provide a power advantage compared to the regular ttest. In the other cases, a substantial amount of statistical power was lost compared to the regular t-test. The power loss of the Welch test can be explained by its lower degrees of freedom determined from the Welch-Satterthwaite equation.$^1$

Results suggest that the Welch t test is indeed inflated, according to Bradley's (1978) fairly stringent criterion, when sample sizes are unequal – even when assumptions for the t test are met in the population. The inflation rate seems to be dependent more on the size of the smaller group than on the total sample size, but sample size ratio does seem to play a small role$^2$

If you read through those papers though, you'll see that it's really only in the specific case with very small sample sizes (in particular, when the smaller of the two groups is very small) that it's much of an issue. "Small" meaning the effects are really only troublesome when a group contains around 5 subjects or less as posited by both papers, but take a closer look at the references for a more thorough discussion. In that case, you might (obviously) suggest collecting more data. But this can of course be an issue with prohibitively expensive experiments.

Otherwise Welch's is probably fine.

$^1$ : Using the Student’s t-test with extremely small sample sizes, J.C.F. de Winter 2013

$^2$ : Type I Error Inflation of the Separate-Variances Welch t test with Very Small Sample Sizes when Assumptions Are Met, Albert K. Adusah and Gordon P. Brooks 2011