Solved – Kolmogorov Smirnov Z vs Mann Whitney U small sample size n= 15

kolmogorov-smirnov testmeanmediansmall-samplewilcoxon-mann-whitney-test

I have a small sample size of 15. I want to see if there is a difference in the nutrient intakes between two independent variables, group 1 n = 11, group 2 n= 4. The data is not normally distributed. Which test is more appropriate, the Mann Whitney U or the Kolmogorov-Smirnov Z test? Andy Field's Discovering Statistics using SPSS states that K-S Z should be used for small sample sizes:

Kolmogorov-Smirnov Z: In Chapter 5 we met a Kolmogorov–Smirnov test
that tested whether a sample was from a normally distributed
population. This is a different test! In fact, it tests whether two
groups have been drawn from the same population (regardless of what
that population may be). In effect, this means it does much the same
as the Mann–Whitney test! However, this test tends to have better
power than the Mann–Whitney test when sample sizes are less than about
25 per group, and so is worth selecting if that’s the case.

Also when reporting the intakes along with the p values, should I use mean and standard deviation or median and IQR as data is non- parametric?

Any advice would be greatly appreciated.

Best Answer

If the original statement doesn't limit the conditions under which it applies pretty substantially, Field is just wrong on this.

Responding to the quoted section:

In effect, this means it does much the same as the Mann–Whitney test!

No, it really doesn't. They really test for different kinds of things. As one example, if two close-to-symmetric distributions differ in spread but don't differ in location, the Kolmogorov-Smirnov can identify that kind of difference (in large enough samples relative to the effect) but the Wilcoxon-Mann-Whitney can't.

This is because they're designed for different purposes.

"However, this test tends to have better power than the Mann–Whitney test when sample sizes are less than about 25 per group, and so is worth selecting if that’s the case."

As a general claim, this is nonsense. Against the things the Mann-Whitney doesn't test it has better power, but against the things the Mann-Whitney is meant for, it doesn't. This doesn't change when $n<25$.

[There may be some situation where the claim is true; if Field doesn't explain what context his claim applies in, I'm not likely to be able to guess it.]

Here's a power curve for n=20 per group. The significance level is a bit over 3% for each test (in fact the achievable significance level for the KS is slightly higher and I have not attempted to use a randomized test to adjust for that difference so it's been given a small advantage in this comparison):

As we see, in this case (the first one I tried) the Wilcoxon-Mann-Whitney is clearly more powerful.

At n=5, the Kolmogorov-Smirnov remains less powerful for this situation. [So what the heck is he talking about? Is he comparing power for some situation not mentioned in the quote? I don't know, but going just on what's quoted here we should not take that claim at face value. It was wrong in the first thing I checked, and - based on a broader familiarity with the two tests, I would readily bet it's wrong for a bunch of other situations.]

At sample sizes of 4 and 11 for shift alternatives (and normal populations), again, the Wilcoxon-Mann-Whitney does better.

With the variable you're looking at, a suitable alternative is probably something more like a scale shift; but if some power (like a square root or a cube root say or better still a log) of your data aren't too non-normal looking these results I mention should be relevant. If you have discrete or zero-inflated data that may make some difference, but my bet would be that the Kolmogorov-Smirnov doesn't overtake the Wilcoxon-Mann-Whitney then either. [I won't pursue this at present because it's not clear if it's relevant for your situation.]

In addition, the attainable significance levels with the Kolmogorov-Smirnov are very gappy at small sample sizes. You often can't get tests close to the usual significance levels you are likely to want. (The WMW does much better than the KS in relation to available test sizes. There is a neat way way to dramatically improve this gappiness of levels situation without losing either the nonparametric or the rank based nature of tests like these - that also doesn't involve randomized testing - but it seems to be very rarely used for some reason.)

Note that I carefully chose examples that made the levels of the two tests close to comparable. If you're just choosing $\alpha=0.05$ every time without regard to the available levels and comparing a p-value to that, then the gappiness of the Kolmogorov-Smirnov's attainable levels is going to make its power much worse in general (though will very occasionally help it a little as here -- these advantage will not generally be much though and probably not enough to help it beat the WMW at the task it's suited for).

If you're in a situation where the Wilcoxon-Mann-Whitney tests what you want to test, I would definitely not recommend using the Kolmogorov-Smirnov instead. I'd use each test for what they're designed to test, which is where they tend to do fairly well.

The best way to figure out what's best is to try some simulations in situations that would be realistic for the kind of data you will have. Then you can see when it does what.

Also when reporting the intakes along with the p values, should I use mean and standard deviation or median and IQR as data is non- parametric?

Data are just data. They're neither parametric nor nonparametric -- that's a property of models and inferential procedures that we use which rely on them (estimation, testing, intervals). Parametric means "defined up to a fixed, finite number of parameters", which is not an attribute of data but of models. If you can't just give both sets of values (which would be my preference) and must instead choose one or the other, which is more relevant scientifically or in relation to your question of interest?

[Note that the Wilcoxon-Mann-Whitney doesn't compare either means or medians (unless you add some assumptions I bet don't come close to applying in this case). Nor does the Kolmogorov-Smirnov.]

Also when reporting the intakes along with the p values, should I use mean and standard deviation or median and IQR

My general advice is to report what makes sense to report for that variable (without worrying very much about what its distribution might be); if you want to know something about the population mean, the sample mean generally makes sense to report, similarly for the population median. Personally I rarely look at only one summary statistic and when reading a paper, I am interested in more than one.

Neither sample means nor sample medians will correspond to what either of the tests here are comparing.

Related Solutions

Solved – How to calculate the effect size for the Kolmogorov-Smirnov Z statistic

Yes. $D = Z/\sqrt{n}$ for the one-sample test. $D = Z/\sqrt{\frac{n_1 n_2}{n_1 + n_2}}$ for the two-sample test. $D$ should also be the "Most Extreme Differences - Absolute" entry in the output graphic (double-click the table shown in the SPSS output viewer). $Z$ might be labeled "Test Statistic," "Kolmogorov-Smirnov Z," or something else depending on which test and version of SPSS you're using.
It depends. Mann-Whitney tests for a difference in the central tendencies by comparing average ranks; K-S tests for a difference in distributions by comparing the maximum difference in empirical cumulative distribution functions. If you expect strong shape differences, such as only low and high values in one group but middle values for the other group (this would be atypical for most data), K-S is a better choice. If you expect just a location shift, Mann-Whitney is more powerful.

Mann-Whitney U Test – Conducting Mann-Whitney U Test and K-S Test with Unequal Sample Sizes

With such large sample sizes both tests will have high power to detect minor differences. The 2 distributions could be almost identical with a small difference in shape location that is not of practical importance and the tests would reject (because they are different).

If all you really care about is a statistically significant difference then you can be happy with the results of the KS test (and others, even a t-test will be meaningful with non-normal data of those sample sizes due to the Central Limit Theorem).

If you care about practical or meaningful differences then things become subjective, but you can compare using various plots to help you decide if you think there are differences that are enough to care about.

Another possibility is doing a visual test as documented in

 Buja, A., Cook, D. Hofmann, H., Lawrence, M. Lee, E.-K., Swayne,
 D.F and Wickham, H. (2009) Statistical Inference for exploratory
 data analysis and model diagnostics Phil. Trans. R. Soc. A 2009
 367, 4361-4383 doi: 10.1098/rsta.2009.0120

The vis.test function in the TeachingDemos package for R helps implement the test, but it can be done by hand as well.

Basically you create a bunch of graphs and then see if you can tell which is which. For your question one possibility would be to create a histogram of the 122,000 observations from the one month, then take several samples of 122,000 from the 300,000 observations of the other month and create histograms of each of those samples. Then present someone (or several someones) with all the histograms in random order and see if they can pick out the one that represents the second month. If they consistently pick out the correct graph then that says there is something visually different and you can further explore how they differ. If they don't pick out the correct graph then that suggests that while there may be a statistally significant difference, it is not important enough to distinguish them visually.

Best Answer

Related Solutions

Solved – How to calculate the effect size for the Kolmogorov-Smirnov Z statistic

Mann-Whitney U Test – Conducting Mann-Whitney U Test and K-S Test with Unequal Sample Sizes

Related Question