I need to compare whether two distributions are similar when the values are scaled by the mean of each of the distribution. One limitation of ks-test as per http://www.itl.nist.gov/div898/handbook/eda/section3/eda35g.htm is "if location, scale, and shape parameters are estimated from the data, the critical region of the K-S test is no longer valid."
Consider for example:
Data1
consist of 10000 numbers from uniform random distribution [0,1] with mean 0.5
Data2
consist of 10000 numbers from uniform random distribution [0,10] with mean 5.001
If I compare Data1 with Data2/10 then ks-test gives that both the distribution are same; while comparing Data1/0.5 with Data2/5.001 gives that the distribution are different. Is there a way to check the similarity between the distributions in such cases?
Edit:
As the answer suggests I can use ks-test where the p-value is determined via permutation.
My additional difficulty is that the data-points are integers:
Data1
consist of 10000 integers from uniform random distribution [0,10] with mean 5
Data2
consist of 10000 integers from uniform random distribution [0,100] with mean 50.001
Is there a test to compare whether Data1
and Data2
are similar apart from the scale? Further, I do not know the actual scale and I am determining it from the data.
These examples are just a proxy for my actual data, which are two experiment where 10000 people rated a movie on a scale [0,10], while in other case 10000 different people rated the same movie on a scale [0,100]. I want to check apart from the scale can one say that whether the distributions are same or not.
Best Answer
One option is to still use the KS test statistic, but instead of using the standard p-value from the KS test (which as you say is not appropriate when estimating from the data), calculate the p-value using a permutation test. The basic steps would be:
Calculate the KS test statistic for the data as is (divided by the estimates).
Now combine the 2 datasets (already divided) and randomly split them into 2 sets of 10,000 (or whatever the original sample size was) and compute the KS test for these new "samples".
Repeat the above many times (999, or 9,999).
The p-value is the proportion of test statistics that are as extreme or more extreme than your original test statistic.