So, for an experiment I analyzed behavioral data of two different treatment groups with $n = 30$ for each of two groups, so 60 subjects in total. I used a computational model to analyze the data and calculate a specific score that describes the motivation to obtain a specific reward.
This score is a continuous variable, but according to a Shapiro-Wilk test, the distribution of the data is non-normal. According to a two-sided Wilcoxon rank sum test, the difference between both groups is not significant ($p \approx .08$).
However, after performing an alternative (model-free) analysis by using logistic regression, I actually found a significant effect of treatment group on behavior.
Additionally, after performing a two sided t-test with the computational data, I also found a significant difference between both groups ($p \approx .04$).
Now the information that I have are quite contradictory. Some say that a sample size of 30 is sufficient to use a t-test, even if the data is not normally distributed. Other sources say that I still should use a u-test, even for "larger" sample sizes.
So I'm a bit confused. Of course, the results of the t-test fit with my hypothesis and the model-free findings. On the other hand, I want to use the "appropriate" test, not the one that confirms my hypothesis.
What do you think? Is my sample size big enough to use a t-test? Or should I go with the u-test, even though it contradicts the results of my alternative method?
EDIT: This is how the boxplot of my data looks like (adapted to the code below)
Best Answer
You should not expect an accurate result from a two-sample t test on samples that are sufficiently far from normal to fail Shapiro-Wilk tests of normality. The P-value about 4% would be just barely significant even if accurate.
If the two samples have approximately the same shape, a Wilcoxon rank sum test might tell you whether population medians are significantly different. However, this test is not quite as powerful as a t test. In any case a P-value about 8% is not impressive evidence for a significant difference between population locations.
@Dave has a good point that you have done too many tests on the data. Cherry picking the smallest P-value of two 2-sample tests would be "P-hacking."
Consider the following fictitious data:
Means are quite different. The issue is whether the difference is statistically significant at, say, the 5% level. Boxplots show strongly right-skewed samples and apparently different dispersions.
Normal probability plots are clearly not linear, so the data should not be assumed normal. The Welch t test may or may not give useful results with sample sizes as large as $n_1=n_2 = 30.$
R code for figure:
My first (and only) test would be a permutation test using the Welch t statistic as metric. This test does not assume that data are normal, nor that the t statistic has a t distribution. It approximates the distribution of the t statistic for our data. [We look at P-values here because the Welch t test tends to have slightly different degrees of freedom at each iteration.]
So the permutation test finds a significant difference at the 3% level.
Because the boxplots show different shapes (dispersions), I would stop there.
If you want to know what the pooled t test, and Wilcoxon rank sum test would have given, here are the results. But we have done a valid test already, so these results are to satisfy curiosity, not as valid test results.
Note: My fictitious right-skewed data for this Answer were sampled from exponential populations. If you know that data are exponential, then there is an exact test. See this Q&A, where it is stated that means of two independent exponential samples, each of size $n$ have $\frac{\bar X_1}{\bar X_2} \sim \mathsf{F}(2n,2n).$ So, for our data with $\bar X_2 > \bar X_1,$ the P-value of an exact 2-sided test is $0.015.$