Solved – Mann-Whitney U test with very large sample size

large dataspsswilcoxon-mann-whitney-test

I'm doing a Mann-Whitney U test to compare two very large samples (sample size 1 = 13250; samlple size 2 = 38871) originating from a raster image. I know t-tests are not recommended to compare rasters, because since rasters have so many values, they will almost surely detect a significant difference, no matter how small that difference might be (see this post in gis.stackexchange.com and point 3 of the first answer to this post).

My question is whether the same problem applies to Mann-Whitney or not. I ran the test on SPSS and got the following results:

Test Statistics
selection frequency
Mann-Whitney U 4094520,000
Wilcoxon W 759591276,000
Z -210,227
Asymp. Sig. (2-tailed) ,000

While I did expect to find a difference between the groups, I don't know what to make of such large values.

There is a paper which did the same thing I did. See the last sentence of the Material and Methods section. However, they obtained much smaller values http://icesjms.oxfordjournals.org/content/69/1/75/T5.expansion.html (they don't make clear if these are Mann-Whitney Us or z-scores). Granted they used a smaller sample (combined sample size = 5629), but the difference in the magnitude of the values still seem strange to me.

So, are my results simply the result of a very large sample, but still valid? Or should I use another test?

Best Answer

This is not a problem of the t-test, but of any test in which the power of the test depends on the sample size. This is called "overpowering". And yes, changing the test to Mann-Whitney will not help.

Therefore, apart from asking whether the results are statistically significant, you need to ask yourself whether the observed effect size is significant in the common sense of the word (i.e., meaningful). This requires more than statistical knowledge, but also your expertise in the field you are investigating.

In general, there are two ways you can look at the effect size. One way is to scale the difference between the means in your data by its standard deviation. Since standard deviation is in the same units as your means and describes the dispersion of your data, you can express the difference between your groups in terms of standard deviation. Also, when you estimate the variance / standard deviation in your data, it does not necessarily decrease with the number of samples (unlike standard deviation of the mean).

This is, for example, the reasoning behind Cohen's $d$:

$$d = \frac{ \bar{x}_1 - \bar{x}_2 }{ s}$$

...where $s$ is the square root of the pooled variance.

$$s = \sqrt{\frac{ s_1^2\cdot(n_1-1) + s_2^2\cdot(n_2 - 1) }{ N - 2 } }$$

(where $N=n_1+n_2$ and $s_1$ and $s_2$ are the standard deviations in group 1 and 2, respectively; that is, $s_1 = \sqrt{ \frac{\sum(x_i-\bar{x_1})^2 }{n_1 -1 }} $).

Another way of looking at the effect size -- and frankly, one that I personally prefer -- is to ask what part (percentage) of the variability in the data can be explained by the estimated effect. You can estimate the variance between and within the groups and see how they relate (this is actually what ANOVA is, and t-test is in principle a special case of ANOVA).This is the reasoning behind the coefficient of determination, $r^2$, and the related $\eta^2$ and $\omega^2$ stats. Now, in a t-test, $\eta^2$ can easily be calculated from the $t$ statistic itself:

$$\eta^2 = \frac{ t^2}{t^2 + n_1 + n_2 - 2 }$$

This value can be directly interpreted as "fraction of variance in the data which is explained by the difference between the groups". There are different rules of thumb to say what is a "large" and what is a "small" effect, but it all depends on your particular question. 1% of the variance explained can be laughable, or can be just enough.

Related Solutions

Solved – Mann-Whitney U test and K-S test with unequal sample sizes

With such large sample sizes both tests will have high power to detect minor differences. The 2 distributions could be almost identical with a small difference in shape location that is not of practical importance and the tests would reject (because they are different).

If all you really care about is a statistically significant difference then you can be happy with the results of the KS test (and others, even a t-test will be meaningful with non-normal data of those sample sizes due to the Central Limit Theorem).

If you care about practical or meaningful differences then things become subjective, but you can compare using various plots to help you decide if you think there are differences that are enough to care about.

Another possibility is doing a visual test as documented in

 Buja, A., Cook, D. Hofmann, H., Lawrence, M. Lee, E.-K., Swayne,
 D.F and Wickham, H. (2009) Statistical Inference for exploratory
 data analysis and model diagnostics Phil. Trans. R. Soc. A 2009
 367, 4361-4383 doi: 10.1098/rsta.2009.0120

The vis.test function in the TeachingDemos package for R helps implement the test, but it can be done by hand as well.

Basically you create a bunch of graphs and then see if you can tell which is which. For your question one possibility would be to create a histogram of the 122,000 observations from the one month, then take several samples of 122,000 from the 300,000 observations of the other month and create histograms of each of those samples. Then present someone (or several someones) with all the histograms in random order and see if they can pick out the one that represents the second month. If they consistently pick out the correct graph then that says there is something visually different and you can further explore how they differ. If they don't pick out the correct graph then that suggests that while there may be a statistally significant difference, it is not important enough to distinguish them visually.

Solved – Interpreting contradicting results T-Test vs. Mann-Whitney test (2 independent samples)

The original Mann-Whitney test assumes continuous distributions.

If there are many observations and few ties, a normal approximation with a correctly-calculated variance is sufficient.

To deal with heavy ties, the Mann-Whitney test needs to properly deal with the effect those ties have on the distribution of ranks under the null; in some cases the effect can be substantial.

What happens in that situation varies from package to package - some packages don't handle heavy ties well.

The p-value from the t-test may well be suspect.

I'd be inclined to perform a permutation test on the actual set of ranks, or if primary interest focuses on testing for a difference in means, perhaps a permutation test based on the means; this way I don't have to rely on either the t-statistic having a distribution close to the t-distribution under the null, nor on the Mann-Whitney correctly dealing with heavy ties.

Best Answer

Related Solutions

Solved – Mann-Whitney U test and K-S test with unequal sample sizes

Solved – Interpreting contradicting results T-Test vs. Mann-Whitney test (2 independent samples)

Related Question