Solved – Mann Whitney or two tailed t-test

likertsample-sizet-testwilcoxon-mann-whitney-test

I am new to statistics and am in need of some help. I am running a study where I am providing two groups with surveys to test for their perceptions using a Likert Scale. One group I'm testing prior to training, the other after several years of experience. Due to the nature of the position, the sample size is pretty small. I've got around 25 in the pre-training group, and around 35 in the experienced group. I don't really have many options as far as obtaining a larger sample size.

Anyway, due to the small amount of data, I've seen some recommendations to go with the Mann Whitney test to compensate for such a small sample size. Can anyone help me out with advice? Thanks!

Here's a sample of one data point if it makes a difference:

GROUP A: 10 8 10 7 10 7 6 10 7 10 5 4 5 10 10 10 2 7 4 9 8 10 10 

GROUP B: 10 10 8 9 10 10 9 10 10 10 10 10 8 8 7 8 10 8 10 10 10 8 9 10 8 10 10 10 10 5 10

Best Answer

The Mann-Whitney doesn't really 'compensate for small sample size'.

And in any case, 25 and 35 for a simple comparison of means is getting up toward middling.

(It's more that at small sample size you may not be able to assess the assumptions of the t-test very well, while at large sample size, the effect of the central limit theorem (plus Slutsky's theorem) should give you approximate normality of the t-statistic even if some of the assumptions are violated, making non-normality and even heterocskedasticity less of an issue - though some other assumptions remain important.)

Unless your scores (within each group) are pretty skew or very heavy tailed or very concentrated in one or two categories, the t-test should not be too badly affected (that is, it's moderately robust against mild skew or moderate heavy-tails, and can outperform the M-W on light-tailed distributions).

If you're in doubt, by all means use the Mann-Whitney; it has good power properties at the normal and if there's a tendency toward skewness or heavy-tailedness will tend to outperform the $t$ on shift-alternatives, possibly quite heavily. And people 'recognize it', which is sometimes useful (in terms of requiring less explanation or justification).

If you do use the $t$-test, consider some assessment of the assumptions:
- I'd recommend a Welch-type adjustment for unequal variances, but if you don't do that, at least check how bad it is - an order of magnitude different might be a concern;
- consider doing a Q-Q plot of residuals to assess how bad the non-normality is.

(At a combined sample size of 60, you should be able to detect severe assumption failure.)

Another alternative is to consider a test based on resampling, such as a permutation test (you could even use the t-statistic if you wish, though its numerator would be more traditional) -- though with your sample sizes you might not be able to consider every arrangement into samples of 25 and 35 (there are about 52 quadrillion of them), so perhaps a randomization or bootstrap test.

The data you posted (although it's not the sample sizes you suggest, so please check it and also respond to my questions below) are:

i) heavily tied, and

ii) pretty strongly left skew.

I'd recommend doing a randomization test based off either a t-test (difference of means) or a Wilcoxon-Mann-Whitney type (difference of mean ranks, or equivalently, sum of first sample ranks) test. It probably won't make a big difference which you choose, but given you'd already be doing randomization, I think perhaps it makes more sense to choose a statistic based off the t-test (something monotonic in it, like the sum or mean of the first sample); so it's still clearly a test for means.

I just did this randomization test on the data you posted - the p-value sits neatly between that of the standard Welch t-test and the Mann-Whitney

For anyone else following along, this is what the data below looks like (which is not the samples of 25 and 35 referred to in the question, but instead smaller samples):

ECDF of perception

barplot


As per the second, my survey consisted of about 25 Lickert Scale questions which were categorized into five different perception fields. So the data I gave was the data from one of those questions.

Are you planning on looking at all 25 separately like this, as a large number of (univariate) two sample tests? Summed within each field and analyzed as a smaller number of univariate two sample tests (five of them)? Summed within each field and analyzed as one multivariate test? Summed to give a single overall perception score and analyzed with one two sample test?

What you're doing may make a difference - the more things you sum, the less the discreteness and skewness will impact things.

Also, my training is very limited in stats, and I'm basically learning on my own. I'm not always sure about some of the terms you use,

You can always ask - I will try to improve my answer where I can.

That being said, if you have any good resources for someone in my position, I'd greatly appreciate it.

It kind of depends on what you need.

Did you want a reference on nonparametrics?

I will say there's no good substitute for the insight you can gain by learning how to do simulation studies of your own to assess the effect of various circumstances on the tests you use, nor for being able to do permutation/randomization and bootstrap versions of tests if you feel the standard tests are badly affected.


@Fly - this is a good question.

The effect of ties is to reduce the variance of the Mann-Whitney statistic and to make it 'more discrete'. Its exact distribution becomes harder to compute but when the sample sizes are large enough for the normal to approximate it well (which takes longer to kick in with ties), this is not such a big issue. In any case, the permutation distribution of both the t-test and the t-test on the ranks (equivalent to the MW) can be obtained for a given set of values - or in larger samples, approximated to any desired degree by sampling.

When one asks about power studies in the presence of ties, it depends on what we're comparing the power to.

e.g. Ties reduce the power relative to a purely continuous (and thereby untied) case; that one's pretty well known.

What we're looking at here is highly discrete (inherently tied) data that's very skew; it doesn't make sense to compare it to a continuous case (though the comparable continuous case would be one in which both samples are highly skew).

When the variable only takes a few values, there's not all that much difference between looking at ranks and looking at the values themselves; they're both quite discrete (consider the ECDF of both).

General reasoning would suggest that in that circumstance a test based on the ties and one on the original data both contain similar information. This is largely confirmed by this study:

de Winter, J. C. F. and D. Dodou (2010),
Five-Point Likert Items: t test versus Mann-Whitney-Wilcoxon
Practical Assessment, Research & Evaluation, Vol 15, No 11

"The results showed that the two tests had equivalent power for most of the pairs. MWW had a power advantage when one of the samples was drawn from a skewed or peaked distribution."

They consider 14 distributions on a 5 point likert scale, which are characterized by verbal descriptions. The particular circumstance under consideration seems to be best approximated by their "Very strongly agree"-scale response vs their "Strongly agree" scale response, though the study was for smaller sample size, and this assessment of what situation we're in is made on the basis of the OP's small sample - we should consider that in fact the distribution may be somewhat different from those ones.

Their data is somewhat more discrete than this case (5 point not 10 point scale) and their sample sizes are smaller (10 and 10 not 23 and 31)

Their results suggest that the t- was slightly more powerful than the Mann-Whitney in that situation. Some of the other nearby cases show less difference. I would expect that the difference would diminish when the discreteness is reduced by increasing the number of points in the scale.

On the other hand, the study in this PhD thesis:

Warachan, B. (2011),
Appropriate Statistical Analysis For Two Independent Groups Of Likert-Type Data
(Faculty of the College of Arts and Sciences, American University, Washington D.C.)

http://aladinrc.wrlc.org/bitstream/handle/1961/11137/Warachan_american_0008E_10166display.pdf?sequence=1

... suggests more of a power advantage to the Mann-Whitney on small effect sizes with skew alternatives at the 30,30 sample size and a 7 point scale. Notably it also looks at the Kolmogorov-Smirnov and that may be even better power-wise, in spite of the fact that its actual type I error rate is much lower than nominal - that is, it's effectively about a 1% test when using the 5% critical values in the 'mildly skew' 7 point situation, while the $t-$ and the Mann-Whitney were both very close to 5% tests.

Which is to say, I don't think there's much to choose between the t- and the Mann- Whitney on the basis of the information in these studies. The impact on type I error rates in the most skewed case are similar, too. Other considerations (such as how likely referees are to accept one test or the other) may be more important.

[The package 'samplesize' in R seems to be useful for investigating some of these t-vs-MW differences on Likert-scales as well.]

Related Question