T-Test – Choosing Between T-Test and Wilcoxon Rank Sum Test for Statistical Analysis

t-testwilcoxon-mann-whitney-test

So, for an experiment I analyzed behavioral data of two different treatment groups with $n = 30$ for each of two groups, so 60 subjects in total. I used a computational model to analyze the data and calculate a specific score that describes the motivation to obtain a specific reward.

This score is a continuous variable, but according to a Shapiro-Wilk test, the distribution of the data is non-normal. According to a two-sided Wilcoxon rank sum test, the difference between both groups is not significant ($p \approx .08$).
However, after performing an alternative (model-free) analysis by using logistic regression, I actually found a significant effect of treatment group on behavior.

Additionally, after performing a two sided t-test with the computational data, I also found a significant difference between both groups ($p \approx .04$).

Now the information that I have are quite contradictory. Some say that a sample size of 30 is sufficient to use a t-test, even if the data is not normally distributed. Other sources say that I still should use a u-test, even for "larger" sample sizes.

So I'm a bit confused. Of course, the results of the t-test fit with my hypothesis and the model-free findings. On the other hand, I want to use the "appropriate" test, not the one that confirms my hypothesis.

What do you think? Is my sample size big enough to use a t-test? Or should I go with the u-test, even though it contradicts the results of my alternative method?

EDIT: This is how the boxplot of my data looks like (adapted to the code below)
Boxplot

Best Answer

You should not expect an accurate result from a two-sample t test on samples that are sufficiently far from normal to fail Shapiro-Wilk tests of normality. The P-value about 4% would be just barely significant even if accurate.

If the two samples have approximately the same shape, a Wilcoxon rank sum test might tell you whether population medians are significantly different. However, this test is not quite as powerful as a t test. In any case a P-value about 8% is not impressive evidence for a significant difference between population locations.

@Dave has a good point that you have done too many tests on the data. Cherry picking the smallest P-value of two 2-sample tests would be "P-hacking."

Consider the following fictitious data:

set.seed(1234)
x1 = rexp(30, 1/10);  x2 = rexp(30, 1/15)
mean(x1); mean(x2)
[1] 9.384906
[1] 17.75834

Means are quite different. The issue is whether the difference is statistically significant at, say, the 5% level. Boxplots show strongly right-skewed samples and apparently different dispersions.

x = c(x1,x2);  g = rep(1:2, each=30)
boxplot(x~g, horizontal=T, col="skyblue2")

enter image description here

Normal probability plots are clearly not linear, so the data should not be assumed normal. The Welch t test may or may not give useful results with sample sizes as large as $n_1=n_2 = 30.$

enter image description here

R code for figure:

par(mfrow=c(1,2))
 qqnorm(x1); qqline(x1, col="blue")
 qqnorm(x2); qqline(x2, col="blue")
par(mfrow=c(1,1))

My first (and only) test would be a permutation test using the Welch t statistic as metric. This test does not assume that data are normal, nor that the t statistic has a t distribution. It approximates the distribution of the t statistic for our data. [We look at P-values here because the Welch t test tends to have slightly different degrees of freedom at each iteration.]

pv.obs = t.test(x~g)$p.val; pv.obs
[1] 0.02797518
pv = replicate(10^5, t.test(x~sample(g))$p.val)
mean(pv <= pv.obs)
[1] 0.02633  # Sim. P-value of permutation text

So the permutation test finds a significant difference at the 3% level.

Because the boxplots show different shapes (dispersions), I would stop there.

If you want to know what the pooled t test, and Wilcoxon rank sum test would have given, here are the results. But we have done a valid test already, so these results are to satisfy curiosity, not as valid test results.

t.test(x1,x2, var.eq=T)$p.val # Pooled
[1] 0.02622975
wilcox.test(x1,x2)$p.val      # Wilcoxon SR
[1] 0.2358858

Note: My fictitious right-skewed data for this Answer were sampled from exponential populations. If you know that data are exponential, then there is an exact test. See this Q&A, where it is stated that means of two independent exponential samples, each of size $n$ have $\frac{\bar X_1}{\bar X_2} \sim \mathsf{F}(2n,2n).$ So, for our data with $\bar X_2 > \bar X_1,$ the P-value of an exact 2-sided test is $0.015.$

f = mean(x2)/mean(x1); f
[1] 1.892223
2*(1 - pf(f, 60, 60))
[1] 0.01470998