Solved – Alternatives to 2-sample Kolmogorov-Smirnov Test

anderson darling testdistributionskolmogorov-smirnov testp-valuestatistical significance

I have 2 samples with numerous repeating numbers like this:

Sample 1: 1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3,4,4,4,4,…

Sample 2: 1,1,1,2,2,2,3,3,3,3,3,3,3,3,3,3,4,4,4,4,4,…

and I would like to compare whether their respective distributions differ.

I initially wanted to use the 2-sample Kolmogorov-Smirnov test for this, but it seems like the test (especially the R implementation of it) doesn't work with ties, which my data obviously has.

Would the K-Sample Anderson-Darling test be a sufficient alternative to the 2-sample KS test? What other alternatives would you recommend (like the 2-table Chi Squared test maybe)?

EDIT: If my 2 samples come from 2 different populations and the corresponding p-values is less than 0.05, then I can reject the null hypothesis (which states that all samples come from a common population). Would the correct conclusion be that the populations from which the samples are derived from are different?

Best Answer

You can still use the Kolmogorov-Smirnov and get the critical values from the permutation distribution of the test statistic. Another approach can be the Chi-Square test.

Here is a toy example that may be suitable for your application:

set.seed(5)
dta<- data.frame(group=c(rep(1,50),rep(2,50)),
      sample=c(sample(1:5,100,replace = TRUE),sample(1:5,100,replace = TRUE)))

Then the chi-square test is simply

chisq.test(x=dta$outcome[dta$group==1],y=dta$outcome[dta$group==2],
simulate.p.value = TRUE)

Observe that I purposedly generated the data using the same process, hence we fail to reject the null.

One last thing. If you still want to use a different method, think of the following. If it's true that 2 distributions follow different laws, then it must be the case that they differ in at least one aspect of their distribution. For instance, the variance. Therefore, if you reject the null hypothesis of, say, equal variances, then you could make inference about these distributions being different.

In the case of the test of variances when the distributions are not continuous (or normally distributed), you can use robust permutation tests. Check the R package RATest.

Of course this approach has several drawbacks since failing to reject the null leaves you at the same point where you started, and then you may have to test for difference of different parameters and so forth. This is not the smartest move but often times it works quite well.

I hope this helps.

Related Solutions

Solved – Equivalent of Kolmogorov-Smirnov test for integer data

The Permutation test could be applied here as well. The idea is as follows.

Let $X_1,...,X_m\sim F$ and $Y_1,...,Y_n\sim G$ be two independent samples and consider testing the hypothesis $H_0:F=G$ vs. $H_1:F\neq G$. For this purpose, label your data as follows

\begin{array}{c c} 1 & X_1\\ 1 & X_2\\ \vdots & \vdots\\ 1 & X_m\\ 2 & Y_1\\ 2 & Y_2\\ \vdots & \vdots\\ 2 & Y_n\\ \end{array}

Now, let $T$ be an statistic of the sample $S=\{X_1,...,X_m,Y_1,...,Y_n\}$ and the labels $L=\{1,1,...,2,2,...,2\}$.

If $H_0$ is true, then the labeling is superfluous.

Now, permute the group labels and recalculate the test statistic a large number of times, say $B$.

The one-sided p-value of this test is calculated as the proportion of sampled permutations where the difference in means was greater than or equal to $T(S,L)$. The two-sided p-value of the test is calculated as the proportion of sampled permutations where the absolute difference was greater than or equal to $\mbox{abs}(T(S,L))$. See

A toy example

Let $X_i \sim \text{Poisson}(10)$, $i=1,...,m=100$, and $Y_j \sim \text{Poisson}(11)$, $j=1,...,n=100$. Consider the statistic $T=\text{mean of Group 1} - \text{mean of Group 2}$. The permutation method using this statistic is implemented below.

rm(list=ls)
set.seed(1)
# Sample size
ns=100
#Simulated data
x = rpois(ns,11)
y = rpois(ns,10)

# Observed statistic    
T.obs = mean(x) - mean(y)

# Pooled data
SL = rbind(cbind(rep(1,ns),x),cbind(rep(2,ns),y))

# Resampling
B=10000
T = rep(0,B)

for(i in 1:B){
samp = sample(SL[,1])
ind1 = which(samp==1)
ind2 = which(samp==2)
T[i] = mean( SL[ind1,2] )- mean( SL[ind2,2] )
}

# p-value
p.value = length(which(abs(T)>abs(T.obs)))/B

I do not know how robust is this method, but after some experiments it seems to perform moderately well. Note that the choice of the statitic $T$ is open and therefore one must be careful on making a meaningful choice in the context of your problem as the performance depends on both the statistic and the sample size.

I hope this helps.

Solved – 2 Sample Kolmogorov-Smirnov vs. Anderson-Darling vs Cramer-von-Mises

To cut a long story short: Anderson-Darling test is assumed to be more powerful than Kolmogorov-Smirnov test.

Have a glance on this article comparing various tests (of normality, but the results hold for comparing two distribudions) Power Comparisons of Shapiro-Wilk, Kolmogorov-Smirnov, Lilliefors and Anderson-Darling Tests by Nornadiah Mohd Razali & Yap Bee Wah.

Anderson-Darling test is much more sensitive to the tails of distribution, whereas Kolmogorov-Smirnov test is more aware of the center of distribution.

To sum up, I would recommend you to use Anderson-Darling or eventually Cramer-von Misses test, to get much more powerful test.

Best Answer

Related Solutions

Solved – Equivalent of Kolmogorov-Smirnov test for integer data

Solved – 2 Sample Kolmogorov-Smirnov vs. Anderson-Darling vs Cramer-von-Mises

Related Question