That is OK, and quite reasonable. It is referred to as the two-sample Kolmogorov-Smirnov test. Measuring the difference between two distribution functions by the supnorm is always sensible, but to do a formal test you want to know the distribution under the hypothesis that the two samples are independent and each i.i.d. from the same underlying distribution. To rely on the usual asymptotic theory you will need continuity of the underlying common distribution (not of the empirical distributions). See the Wikipedia page linked to above for more details.
In R, you can use the ks.test
, which computes exact $p$-values for small sample sizes.
This is a great question, and it blows my mind that there is not an obvious answer, given that this is essentially the most fundamental statistical comparison scientists make. I came here to ask the exact same question. I don't have a full answer, but I can tell you the inelegant way I'm approaching this problem.
1) Rather than treating each element as a precise value, construct a probability distribution for each element in your samples, (Pi(x)). If your errors are approximately normal then this would probably be a Gaussian distribution centered on your measured value. In your case this gives you ~240 different probability distributions for each sample.
2) Co-add all the probability distributions in each sample (and normalize by the number of measurements in your sample) to create the total sample's distribution probability density (D(x)):
D(x)=( SUM[Pi(x)] from i=1 to N ) / N (where N is the number of sources in asample)
Do this for both samples.
3) Use the distribution probability densities to come up with cumulative density functions for each sample: CDF(x)=Integral[ D(y) dy] from y=-infinity to x
Do this for both samples.
4) Compare these CDFs as you would in a normal KS test. Find their max difference, D.
This D is essentially equivalent to the KS D statistic, but does it translate the same way into a probability of rejecting the null hypothesis? I think the KS test is theoretically rooted in data with single values, so I'm not sure we can certain. To get around this theoretical discomfort, we can at least check to see if your measured D value is significantly greater than any random permutation of samples composed of all the elements in your two samples.
5) Once you have your "real" D value, go back and randomly shuffle which elements are in sample 1 and which are in sample 2 (but keep the total number of elements in each sample the same as before). Repeat steps 1-4 to come up with a D value for this randomly assembled comparison of samples. Do this a few hundred or thousand times and you'll come up with a distribution of D values.
6) How does your "real" D value compare to this distribution? If it is greater than 99% (or 95% or 90%...) of them, that's a good indication your samples' distributions differ significantly more than would be expected if they truly represented the same underlying distribution.
Since this is such an important and basic scientific question, part of me assumes that there just MUST be a theoretically-grounded approach to it. So far I haven't found it.
Best Answer
I can't retrieve your data from your histograms. So let's look at largish datasets from a couple of slightly different distributions and see if the K-S two-sample test can discern that the two samples aren't from the same distribution. [I'm using R.]
Here are histograms of the two sample, each with the density function of its population shown for reference.
Somewhat similar, but not exactly the same. For example, $\mu_1 = 11/20 = 5.5$ and $\mu_2 = 12/20 = 6.0.$ Furthermore, the K-S test rejects the null hypothesis that the two samples came from the same distribution. K-S tests aren't exactly famous for their good power, but with $n=1000$ observations from each sample, the test was able to reject with P-value very near $0.$
The test statistic $D$ of the K-S test is the maximum vertical distance between the empirical CDFs (ECDFs) of the samples.
As seen in the ECDF plots,
x2
(brown) stochastically dominatesx1
(blue) because the former plot lies consistently to the right of the latter.Because the shapes of the two distributions aren't exactly the same, some might say a two-sample Wilcoxon test is not entirely appropriate. I would not want to claim the Wilcoxon test finds that the median of
x2
to be larger than the median ofx1
, but the Wilcox test does find a difference between the two samples.