p-values are random quantities; they're a function of your (random) samples, so naturally they vary from sample to sample.
[Indeed, for a point null hypothesis, and with a continuous rather than discrete test statistic, then if $H_0$ were true, the p-values generated in this manner would be uniform on (0,1).]
As the situation moves further and further from the null in the manner measured by the test statistic, the distribution of p-values skews toward the low end. The p-value is always random (with a long tail to the larger p-values), but it tends to be stochastically smaller.
So you're simply expecting something that won't happen. The p-values will never be particularly consistent in value; they don't tend to concentrate around some underlying "population" value.
What you're seeing is how hypothesis tests work.
Your results - considered all together - already indicate the null is false (point nulls are rarely true, so this is not much of a surprise). If you can take larger samples, your typical p-values will become smaller. If you can't take larger samples but can generate as many hypothesis tests as you wish, you could combine them ... and get an overall p-value smaller than any given positive bound.
From the little bit of information you give in your description, it doesn't sound like a hypothesis test is a good choice for your situation. If you describe your underlying problem in greater detail (what you're trying to achieve before you get to the point of deciding to use hypothesis tests), it may be that alternatives more suited to your needs could be suggested.
Answering the question 'do they differ' is probably pointless (they do, at least a little -- enough data will tell you this). A more useful question is probably nearer to "are they different enough that it matters?"
This is a great question, and it blows my mind that there is not an obvious answer, given that this is essentially the most fundamental statistical comparison scientists make. I came here to ask the exact same question. I don't have a full answer, but I can tell you the inelegant way I'm approaching this problem.
1) Rather than treating each element as a precise value, construct a probability distribution for each element in your samples, (Pi(x)). If your errors are approximately normal then this would probably be a Gaussian distribution centered on your measured value. In your case this gives you ~240 different probability distributions for each sample.
2) Co-add all the probability distributions in each sample (and normalize by the number of measurements in your sample) to create the total sample's distribution probability density (D(x)):
D(x)=( SUM[Pi(x)] from i=1 to N ) / N (where N is the number of sources in asample)
Do this for both samples.
3) Use the distribution probability densities to come up with cumulative density functions for each sample: CDF(x)=Integral[ D(y) dy] from y=-infinity to x
Do this for both samples.
4) Compare these CDFs as you would in a normal KS test. Find their max difference, D.
This D is essentially equivalent to the KS D statistic, but does it translate the same way into a probability of rejecting the null hypothesis? I think the KS test is theoretically rooted in data with single values, so I'm not sure we can certain. To get around this theoretical discomfort, we can at least check to see if your measured D value is significantly greater than any random permutation of samples composed of all the elements in your two samples.
5) Once you have your "real" D value, go back and randomly shuffle which elements are in sample 1 and which are in sample 2 (but keep the total number of elements in each sample the same as before). Repeat steps 1-4 to come up with a D value for this randomly assembled comparison of samples. Do this a few hundred or thousand times and you'll come up with a distribution of D values.
6) How does your "real" D value compare to this distribution? If it is greater than 99% (or 95% or 90%...) of them, that's a good indication your samples' distributions differ significantly more than would be expected if they truly represented the same underlying distribution.
Since this is such an important and basic scientific question, part of me assumes that there just MUST be a theoretically-grounded approach to it. So far I haven't found it.
Best Answer
Instead of using the KS test you could simply use a permutation or resampling procedure as implemented in the
oneway_test
function of thecoin
package. Have a look at the accepted answer to this question.Update: My package
afex
contains the functioncompare.2.vectors
implementing a permutation and other tests for two vectors. You can get it from CRAN:For two vectors
x
andy
it (currently) returns something like:Any comments regarding this function are highly welcomed.