Solved – Alternative to Kolmogorov-Smirnov test when parameters are estimated from the data

discrete datahypothesis testingkolmogorov-smirnov testnonparametric

I need to compare whether two distributions are similar when the values are scaled by the mean of each of the distribution. One limitation of ks-test as per http://www.itl.nist.gov/div898/handbook/eda/section3/eda35g.htm is "if location, scale, and shape parameters are estimated from the data, the critical region of the K-S test is no longer valid."

Consider for example:

Data1 consist of 10000 numbers from uniform random distribution [0,1] with mean 0.5

Data2 consist of 10000 numbers from uniform random distribution [0,10] with mean 5.001

If I compare Data1 with Data2/10 then ks-test gives that both the distribution are same; while comparing Data1/0.5 with Data2/5.001 gives that the distribution are different. Is there a way to check the similarity between the distributions in such cases?

Edit:
As the answer suggests I can use ks-test where the p-value is determined via permutation.

My additional difficulty is that the data-points are integers:

Data1 consist of 10000 integers from uniform random distribution [0,10] with mean 5

Data2 consist of 10000 integers from uniform random distribution [0,100] with mean 50.001

Is there a test to compare whether Data1 and Data2 are similar apart from the scale? Further, I do not know the actual scale and I am determining it from the data.

These examples are just a proxy for my actual data, which are two experiment where 10000 people rated a movie on a scale [0,10], while in other case 10000 different people rated the same movie on a scale [0,100]. I want to check apart from the scale can one say that whether the distributions are same or not.

Best Answer

One option is to still use the KS test statistic, but instead of using the standard p-value from the KS test (which as you say is not appropriate when estimating from the data), calculate the p-value using a permutation test. The basic steps would be:

Calculate the KS test statistic for the data as is (divided by the estimates).

Now combine the 2 datasets (already divided) and randomly split them into 2 sets of 10,000 (or whatever the original sample size was) and compute the KS test for these new "samples".

Repeat the above many times (999, or 9,999).

The p-value is the proportion of test statistics that are as extreme or more extreme than your original test statistic.

Related Question