Solved – Mann-Whitney U test with very large sample size

large dataspsswilcoxon-mann-whitney-test

I'm doing a Mann-Whitney U test to compare two very large samples (sample size 1 = 13250; samlple size 2 = 38871) originating from a raster image. I know t-tests are not recommended to compare rasters, because since rasters have so many values, they will almost surely detect a significant difference, no matter how small that difference might be (see this post in gis.stackexchange.com and point 3 of the first answer to this post).

My question is whether the same problem applies to Mann-Whitney or not. I ran the test on SPSS and got the following results:

  • Test Statistics
  • selection frequency
  • Mann-Whitney U 4094520,000
  • Wilcoxon W 759591276,000
  • Z -210,227
  • Asymp. Sig. (2-tailed) ,000

While I did expect to find a difference between the groups, I don't know what to make of such large values.

There is a paper which did the same thing I did. See the last sentence of the Material and Methods section. However, they obtained much smaller values http://icesjms.oxfordjournals.org/content/69/1/75/T5.expansion.html (they don't make clear if these are Mann-Whitney Us or z-scores). Granted they used a smaller sample (combined sample size = 5629), but the difference in the magnitude of the values still seem strange to me.

So, are my results simply the result of a very large sample, but still valid? Or should I use another test?

Best Answer

This is not a problem of the t-test, but of any test in which the power of the test depends on the sample size. This is called "overpowering". And yes, changing the test to Mann-Whitney will not help.

Therefore, apart from asking whether the results are statistically significant, you need to ask yourself whether the observed effect size is significant in the common sense of the word (i.e., meaningful). This requires more than statistical knowledge, but also your expertise in the field you are investigating.

In general, there are two ways you can look at the effect size. One way is to scale the difference between the means in your data by its standard deviation. Since standard deviation is in the same units as your means and describes the dispersion of your data, you can express the difference between your groups in terms of standard deviation. Also, when you estimate the variance / standard deviation in your data, it does not necessarily decrease with the number of samples (unlike standard deviation of the mean).

This is, for example, the reasoning behind Cohen's $d$:

$$d = \frac{ \bar{x}_1 - \bar{x}_2 }{ s}$$

...where $s$ is the square root of the pooled variance.

$$s = \sqrt{\frac{ s_1^2\cdot(n_1-1) + s_2^2\cdot(n_2 - 1) }{ N - 2 } }$$

(where $N=n_1+n_2$ and $s_1$ and $s_2$ are the standard deviations in group 1 and 2, respectively; that is, $s_1 = \sqrt{ \frac{\sum(x_i-\bar{x_1})^2 }{n_1 -1 }} $).

Another way of looking at the effect size -- and frankly, one that I personally prefer -- is to ask what part (percentage) of the variability in the data can be explained by the estimated effect. You can estimate the variance between and within the groups and see how they relate (this is actually what ANOVA is, and t-test is in principle a special case of ANOVA).This is the reasoning behind the coefficient of determination, $r^2$, and the related $\eta^2$ and $\omega^2$ stats. Now, in a t-test, $\eta^2$ can easily be calculated from the $t$ statistic itself:

$$\eta^2 = \frac{ t^2}{t^2 + n_1 + n_2 - 2 }$$

This value can be directly interpreted as "fraction of variance in the data which is explained by the difference between the groups". There are different rules of thumb to say what is a "large" and what is a "small" effect, but it all depends on your particular question. 1% of the variance explained can be laughable, or can be just enough.

Related Question