Solved – Testing if two non-normal distributions are significantly different (K-S or Wilcoxon or both?)

distributionskolmogorov-smirnov testnonparametricwilcoxon-mann-whitney-test

The Big Question

I'm trying to decide between K-S and Wilcoxon (or something I haven't thought of yet) for two data sets I want to prove are significantly/insignificantly different from one another

"Significantly different in what way?" you ask? Well, I want to compare if they trend differently, so I guess that means I want to see if they are from the same population. I think comparing the means in this case might not be a valuable way of testing this, and my data is non-normal I ruled out Student's t (but that might just be my ignorance).

Because my sample size is smallish (see below in the How I've Treated My Data heading), and because the distribution is non-normal, I was thinking that either the Kolmogorov-Smirnov or the Wilcoxon signed-rank test would be best to compare the two distributions.

Are there other better tests that suit my needs better?

How I've Treated My Data

I'm not terribly sure how relevant this is, but it JUST MIGHT be of use to you folks, so here it goes:

So, I have two groups of paired histograms, that is to say that each column represents a range of values (e.g <1, 1-5,5-10…etc) and the value of each cell is the number of elements in my sample within that value range.

There are 9 pairs in group 1, 15 pairs in group 2

EDIT: Each pair represents a gene, and the first row of each pair is the wildtype data while the second row is the mutated data. Each group represents a family of genes.

[Group 1]
Pair 1-1,  11   0   12  0   58  1   72  8   18  31
Pair 1-2,  23   0   6   0   54  8   70  11  21  18
...
Pair n-1, ...
Pair n-2, ...

[Group 2]
Pair 1-1,  11   0   12  0   58  1   72  8   18  31
Pair 1-2,  23   0   6   0   54  8   70  11  21  18
...
Pair n-1, ...
Pair n-2, ...

I want to see the relative change of the histograms between each pair, so I've normalised them to one another using this formula (because I often have one of the value equalling zero while the other doesn't).

EDIT: I want to see the relative change between these histograms to see the relative increase or decrease between the two.

$$
\frac{( {A} – {B })} {\frac{A+B}{2}}
$$

Which resulted in a table that looks something like

[Group 1]
Pair 1     10.00    0.32    0.71    -0.03   -0.07   0.15    -0.53   -0.67   1.56    0.00
...
Pair n     ....


[Group 2]
Pair 1     10.00    0.32    0.71    -0.03   -0.07   0.15    -0.53   -0.67   1.56    0.00
...
Pair n     ....

I then averaged the data in every column two see the average normalised number of members in each category of each group.

That results in something that looks like

[Group 1]
0.00    -0.15   0.28    1.15    0.25    -0.54   -0.08   -0.46   0.25    0.00

[Group 2]
0.00    0.33    0.05    0.26    -0.03   0.03    -0.79   0.16    -0.26   0.00

It is these two distributions, one from each group, that I wish to compare.

EDIT: I hope to see the difference in distribution between the family of genes.


Best Answer

Kolmogorov-Smirnov and similar goodness-of-fit tests work best for detecting a shift in the bulk of the distribution. As I understand, the ordering of the genes in your data set does not really matter. i.e. you would like to have equal power for detecting deviations regardless of the gene ordering.

Furthermore, your data is discrete and therefore the p-values returned by the Kolmogorov-Smirnov test would be inaccurate.

For testing goodness-of-fit between two discrete categorical distributions, the standard answer is to use the chi-squared test. However, the existence of genes with very few samples means that the chi-squared p-value would be inaccurate - the standard rule of thumb is that each bin needs to have at least ~5 samples.

Instead, I suggest using Fisher's Exact test for general n by 2 tables. This test uses simulations and provides a good estimate of the p-value. Here is an example of an R package implementing the general n by m test.