Solved – Statistical description of how two distributions are different

distributionskolmogorov-smirnov test

I am have a collection of events recorded in two separate conditions, and I have measured the duration of each one of them. I would like to know two things: 1) are the distributions of durations different between the two conditions, and 2) if so, how are they different? For example, maybe the biggest difference between the two conditions is that there are far more 10 ms long events in condition #1 than in condition #2.

I feel confident in using a Kolmogorov-Smirnov test to determine whether the distributions are significantly different, but am stuck on how to describe what the differences actually are using a statistical test or collection of test statistics. I could histogram the data and run some sort of ANOVA, but that seems a little course and inelegant. Or maybe it is fair to plot the distributions on top of each other and simply read off the difference at areas of interest. Any advice on how to tackle this problem?

Below is some example data. Here a Kolmogorov-Smirnov test comparing these two distributions returns a very small p-value (p < 0.000001), so I know that the two distributions come from different populations. Based on how I think the KS test works, this also tells me that where the two distributions are widest apart in the horizontal plane (around 0.9 on the y-axis) the two distributions are significantly different (there are more events with durations around 60 ms in group 1 than in group 2). But what about everywhere else in the distribution? Is there a way for me to know whether the difference in the amount of events that are 10 ms long is also statistically significant?

enter image description here

Best Answer

From a descriptive standpoint, a QQ plot is about as good as it gets for something like this, you can really see where in the distributions things are different (heavier tails, skew, location shift, outliers). That might give you insight about what parameter of the distributions you are interested in testing.

The issue with a statistical test is that it demands that you really clearly state what you care about, i.e., your null hypothesis. As an example, if you wanted to compare whether or not the mean of two distributions was different, you might not care that their standard deviations are different:

a = rnorm(100, sd = 3); b = rnorm(100, sd = 1))
> t.test(a,b)$p.val
[1] 0.2825168

> ks.test(a,b)$p.val
[1] 7.14257e-05

If our null hypothesis is that the means are different, it wouldn't be correct to test with the KS-test and we see we'd incorrectly reject the null. In your case, you have to state, what aspect of the distribution you care about and then test accordingly.

It would be fine to ask, Is the 10th-percentile different -- that's a good null hypothesis. I'm not aware of well-known statistical test for such a thing, but you could certainly use a bootstrap-based approach.