Solved – Alternatives to 2-sample Kolmogorov-Smirnov Test

anderson darling testdistributionskolmogorov-smirnov testp-valuestatistical significance

I have 2 samples with numerous repeating numbers like this:

Sample 1: 1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3,4,4,4,4,…

Sample 2: 1,1,1,2,2,2,3,3,3,3,3,3,3,3,3,3,4,4,4,4,4,…

and I would like to compare whether their respective distributions differ.

I initially wanted to use the 2-sample Kolmogorov-Smirnov test for this, but it seems like the test (especially the R implementation of it) doesn't work with ties, which my data obviously has.

Would the K-Sample Anderson-Darling test be a sufficient alternative to the 2-sample KS test? What other alternatives would you recommend (like the 2-table Chi Squared test maybe)?

EDIT: If my 2 samples come from 2 different populations and the corresponding p-values is less than 0.05, then I can reject the null hypothesis (which states that all samples come from a common population). Would the correct conclusion be that the populations from which the samples are derived from are different?

Best Answer

You can still use the Kolmogorov-Smirnov and get the critical values from the permutation distribution of the test statistic. Another approach can be the Chi-Square test.

Here is a toy example that may be suitable for your application:

set.seed(5)
dta<- data.frame(group=c(rep(1,50),rep(2,50)),
      sample=c(sample(1:5,100,replace = TRUE),sample(1:5,100,replace = TRUE)))

Then the chi-square test is simply

chisq.test(x=dta$outcome[dta$group==1],y=dta$outcome[dta$group==2],
simulate.p.value = TRUE)

Observe that I purposedly generated the data using the same process, hence we fail to reject the null.

One last thing. If you still want to use a different method, think of the following. If it's true that 2 distributions follow different laws, then it must be the case that they differ in at least one aspect of their distribution. For instance, the variance. Therefore, if you reject the null hypothesis of, say, equal variances, then you could make inference about these distributions being different.

In the case of the test of variances when the distributions are not continuous (or normally distributed), you can use robust permutation tests. Check the R package RATest.

Of course this approach has several drawbacks since failing to reject the null leaves you at the same point where you started, and then you may have to test for difference of different parameters and so forth. This is not the smartest move but often times it works quite well.

I hope this helps.

Related Question