Solved – Interpreting Two-sample Kolmogorov-Smirnov with jerzy

interpretationkolmogorov-smirnov test

I am using the project jerzy to run a Two-sample Kolmogorov-Smirnov test in Javascript, regarding another question I asked on stats.SE: Timing attacks: When the time to complete two different tasks are statistically indistinguishable.

Not knowing anything about statistics, I would be grateful for some assistance interpreting the results. The code for the test being run is on Github (pieterprovoost/jerzy).

I am in particular running the test like this (and the numbers are contrived):

 var diff = [750, 740, 790]; // array of nanosecond results
 var equal = [750, 610, 960]; 
 results.diff = new jerzy.Vector(diff);
 results.equal = new jerzy.Vector(equal);
 var ks = new jerzy.Nonparametric.kolmogorovSmirnov(diff, equal);
 console.log(ks);

The output I get from the real data is something along the lines of:

 { d: 0.032657926102502954,
   ks: 0.6660700343005954,
   p: 0.7667168595417211 }

In the real tests the diff and equal are arrays of nanosecond timings. I would like to establish with some confidence that the arrays are effectively from the same distribution, with a difference of around 15ns.

How would one interpret the above result of the kolmogorovSmirnov function of jerzy, in terms of how strongly one might state the probability and confidence that the two arrays are from the same distribution?

Best Answer

The null hypothesis of the two-sample Kolmogorov–Smirnov test is that the two datasets are coming from the same distribution. The test is essentially trying to reject the null hypothesis, and, if it fails to do so, the alternative hypothesis is accepted.

The decision (to reject the null hypothesis) is based on the p-value computed for the given data; this is what the $p$ attribute of your JavaScript object stands for. Before performing the test, one typically decides on the significance level suitable for the problem at hand; this level is conventionally denoted by $\alpha$ and quite often chosen to be 0.05. Then the null hypothesis is rejected if

$$ \alpha \geq p. $$

So, in your case, the test, having at its disposal only three points in each dataset, has failed to reject the null hypothesis at significance level 0.05. In order to reject the hypothesis, the test might need more points.

The $d$ attribute provided by jerzy is the uniform distance (maximal pointwise distance) between the empirical CDFs computed for the two datasets, and $ks$ is $d$ multiplied by a factor related to the two-sample Kolmogorov–Smirnov test.

Lastly, the Kolmogorov–Smirnov test does not provide any confidence intervals that you are asking for. Some other test might be better suited if you need confidence intervals.

Related Solutions

Kolmogorov-Smirnov Test – How to Perform a Kolmogorov-Smirnov Two-Sample Test

I am assuming you are asking because the Suanshu help page reports in reference to the K-S distribution, "This is not done yet." Luckily, it is very easy to do in R. If x and y are your two samples, ks.test(x,y) returns the test statistic and pvalue. For example,

> x <- rnorm(50)
> y <- runif(30)
> ks.test(x, y)    
        Two-sample Kolmogorov-Smirnov test    
data:  x and y 
D = 0.5, p-value = 9.065e-05
alternative hypothesis: two-sided

By default, it will compute exact or asymptotic p-values based on the product of the sample sizes (exact p-values for n.x*n.y < 10000 in the two-sample case), or you can specify this option with a third argument, exact=F or exact=T. Exact p-values are calculated using the methods of Marsaglia, et al. (2003), which the Suanshu documentation also cites. Some large sample approximations are given here, although I don't have a proper citation. Lastly, if you don't want to install R, there are web calculators for the two-sample K-S test, although I don't know if they use the same algorithm as R because the one I found only reported three decimal points for the p-value.

Solved – Kolmogorov-Smirnov two-sample $p$-values

Under the null hypothesis, the asymptotic distribution of the two-sample Kolmogorov–Smirnov statistic is the Kolmogorov distribution, which has CDF

$$\operatorname{Pr}(K\leq x)=\frac{\sqrt{2\pi}}{x}\sum_{i=1}^\infty e^{-(2i-1)^2\pi^2/(8x^2)} \>.$$

The $p$-values can be calculated from this CDF - see Section 4 and Section 2 of the Wikipedia page on the Kolmogorov–Smirnov test.

You seem to be saying that a non-parametric test statistic shouldn't have a distribution - that's not the case - what makes this test non-parametric is that the distribution of the test statistic does not depend on what continuous probability distribution the original data come from. Note that the KS test has this property even for finite samples as shown by @cardinal in the comments.

Best Answer

Related Solutions

Kolmogorov-Smirnov Test – How to Perform a Kolmogorov-Smirnov Two-Sample Test

Solved – Kolmogorov-Smirnov two-sample $p$-values

Related Question