Solved – Use of two-sample Kolmogorov-Smirnov test to evaluate similarities between two different distributions

kolmogorov-smirnov testr

$X_1, X_2, \dots X_n$ and $Y_1, Y_2, \dots Y_n; n = 1000$ are two samples of physical quantities coming from the application of two different mathematical models to some independent and identically distributed (iid) data.

The mathematical model used to generate $Y_i%$ is a simplified version (tuned by the parameter $r$) of the mathematical model used to generate $X_i$.

My goal is to find the value of $r$ that makes the empirical distribution $F_Y(y)$ to be as similar as possible to the distribution $F_X(x)$.

I decided to use the Two-Sample Kolmogorov-Smirnov test (in R) for different value of $r$ ranging in a specific interval. Is this choice correct?

I know that the null hypothesis for the K-S test is that the two distributions are the same. However, I know for sure that the two distributions are different because the two mathematical models are different. Is it correct to evaluate the best value of $r$ by looking at the p-value and the D statistic coming from the K-S test?

Best Answer

You say you're trying to make the two distributions close; the $D$ statistic measures the discrepancy between the two, and (as long as it's not changing the dimension of the fit) choosing $r$ to minimize that discrepancy makes complete sense.

I don't think there's any need to deal with the $p$-value; $D$ is a sensible thing to optimize.

If you're changing the dimension of the fit (adding or removing parameters), just minimizing $D$ isn't going to be sufficient, since more parameters will always tend to improve the fit.

Related Solutions

Solved – Kolmogorov-Smirnov two-sample test

I am assuming you are asking because the Suanshu help page reports in reference to the K-S distribution, "This is not done yet." Luckily, it is very easy to do in R. If x and y are your two samples, ks.test(x,y) returns the test statistic and pvalue. For example,

> x <- rnorm(50)
> y <- runif(30)
> ks.test(x, y)    
        Two-sample Kolmogorov-Smirnov test    
data:  x and y 
D = 0.5, p-value = 9.065e-05
alternative hypothesis: two-sided

By default, it will compute exact or asymptotic p-values based on the product of the sample sizes (exact p-values for n.x*n.y < 10000 in the two-sample case), or you can specify this option with a third argument, exact=F or exact=T. Exact p-values are calculated using the methods of Marsaglia, et al. (2003), which the Suanshu documentation also cites. Some large sample approximations are given here, although I don't have a proper citation. Lastly, if you don't want to install R, there are web calculators for the two-sample K-S test, although I don't know if they use the same algorithm as R because the one I found only reported three decimal points for the p-value.

Solved – Kolmogorov-Smirnov two-sample $p$-values

Under the null hypothesis, the asymptotic distribution of the two-sample Kolmogorov–Smirnov statistic is the Kolmogorov distribution, which has CDF

$$\operatorname{Pr}(K\leq x)=\frac{\sqrt{2\pi}}{x}\sum_{i=1}^\infty e^{-(2i-1)^2\pi^2/(8x^2)} \>.$$

The $p$-values can be calculated from this CDF - see Section 4 and Section 2 of the Wikipedia page on the Kolmogorov–Smirnov test.

You seem to be saying that a non-parametric test statistic shouldn't have a distribution - that's not the case - what makes this test non-parametric is that the distribution of the test statistic does not depend on what continuous probability distribution the original data come from. Note that the KS test has this property even for finite samples as shown by @cardinal in the comments.

Best Answer

Related Solutions

Solved – Kolmogorov-Smirnov two-sample test

Solved – Kolmogorov-Smirnov two-sample $p$-values

Related Question