Solved – Two-sample Kolmogorov-Smirnov test with errors on data points

distributionserrorkolmogorov-smirnov testsampleskewness

Short version

I want to test if two samples, which follow a skewed distribution, can be distinguished from each other. A Kolmogorov-Smirnov statistic for two samples would be sufficient if there was a way to include errors. Each data point that I have has an error measurement attached to it and these errors are relatively large and vary wildly between each other.

Long version

In the following figures, I plot the histograms of two data sets each containing two samples, a red sample and a blue sample. The vertical line denotes the mean. As is visible, in both cases the two sample distributions are skewed to each other. There are around ~240 data points in each histogram for both data sets, where the red and blue sample within a data set are of the same size. The data are restricted in the range [-0.2, 0.4] and the size of the average error for each data point is 0.12 (welcome to my world!).

First histogram and means of the two sample distributions
Second histogram and means of the two sample distributions

A Kolmogorov-Smirnov statistic for 2 samples gives me that the p-value is 6e-79 with a KS statistic of 0.85 for the first data set and p=6e-40 with a KS statistic of 0.57 for the second data set (calculated with scipy.stats.ks_2samp). As you can guess from the size of the average error bar, that sounds too great to be true.

Searching the internet for a weighted Kolmogorov-Smirnov test taking into account errors unfortunately yields nothing. That is imho what I need, unless I can get some statistical reassurance that I do not need to worry on the data point errors in my case.

Best Answer

This is a great question, and it blows my mind that there is not an obvious answer, given that this is essentially the most fundamental statistical comparison scientists make. I came here to ask the exact same question. I don't have a full answer, but I can tell you the inelegant way I'm approaching this problem.

1) Rather than treating each element as a precise value, construct a probability distribution for each element in your samples, (Pi(x)). If your errors are approximately normal then this would probably be a Gaussian distribution centered on your measured value. In your case this gives you ~240 different probability distributions for each sample.

2) Co-add all the probability distributions in each sample (and normalize by the number of measurements in your sample) to create the total sample's distribution probability density (D(x)): D(x)=( SUM[Pi(x)] from i=1 to N ) / N (where N is the number of sources in asample) Do this for both samples.

3) Use the distribution probability densities to come up with cumulative density functions for each sample: CDF(x)=Integral[ D(y) dy] from y=-infinity to x Do this for both samples.

4) Compare these CDFs as you would in a normal KS test. Find their max difference, D.

This D is essentially equivalent to the KS D statistic, but does it translate the same way into a probability of rejecting the null hypothesis? I think the KS test is theoretically rooted in data with single values, so I'm not sure we can certain. To get around this theoretical discomfort, we can at least check to see if your measured D value is significantly greater than any random permutation of samples composed of all the elements in your two samples.

5) Once you have your "real" D value, go back and randomly shuffle which elements are in sample 1 and which are in sample 2 (but keep the total number of elements in each sample the same as before). Repeat steps 1-4 to come up with a D value for this randomly assembled comparison of samples. Do this a few hundred or thousand times and you'll come up with a distribution of D values.

6) How does your "real" D value compare to this distribution? If it is greater than 99% (or 95% or 90%...) of them, that's a good indication your samples' distributions differ significantly more than would be expected if they truly represented the same underlying distribution.

Since this is such an important and basic scientific question, part of me assumes that there just MUST be a theoretically-grounded approach to it. So far I haven't found it.