Solved – Kolmogorov-Smirnov test applying in R

kolmogorov-smirnov testr

I tried to use the Kolmogorov-Smirnov test to test whether a sample is exponentially distributed. With the try and error method I tried a couple of rates. This is a small simple example of what I do:

    ks.test(Interarrivaltimes,pexp,0.00029)

Here is the result R gives me:

One-sample Kolmogorov-Smirnov test

data:  Interarrivaltimes
D = 0.023961, p-value < 2.2e-16
alternative hypothesis: two-sided

The p-value is very low whereas the test should accept the null-hypothesis.

I do not understand why it does not work.

Best Answer

It's hard to give a specific answer without the details requested earlier, but I think I can point you in the right general direction.

First, let's consider a sample of n = 15 from an exponential distribution with a rate of 0.00029. When we run the ks.test, we fail to reject the null hypothesis, as expected.

set.seed(pi)

x <- rexp(15, 0.00029)
ks.test(x, pexp, 0.00029)

Now let's consider a case where n = 1,000, and the rate is still 0.00029. In this particular instance, we get a p-value of 0.9784. Again, we fail to reject the null hypothesis.

x2 <- rexp(1000, 0.00029)
ks.test(x2, pexp, 0.00029)

Now let's look at something we're more likely to see in practice. When we take a sample, we usually have to estimate the parameters of distributions. So if your inter-arrival times come from a sample and you've estimated that the rate is 0.00029, that is only an estimate and doesn't tell us what the true population rate is. Why is this important?

At a small sample size, you probably won't detect much of a difference between your estimated distribution and your population distribution. Let's assume that the population rate is actually 0.00030, but you've gotten a very, very close estimate of 0.00029. A difference of one hundred thousandth doesn't seem like much, does it? In a sample size of 15, we still fail to reject the null hypothesis (p = 0.8255).

y <- rexp(15, 0.00030)
ks.test(y, pexp, 0.00029)

Now let's take a large sample of n = 1,000. In this example, even with such a small difference between the population rate and the estimate rate, we get a p-value of 0.07506, which is very close to that common 0.05 threshold of significance.

y2 <- rexp(1000, 0.00030)
ks.test(y2, pexp, 0.00029)

In yet another sample of 1000, we can get a p-value of 0.008196, which rejects the null hypothesis at most significance levels.

The moral of the story is that very small differences from the population parameter can be detected as "significantly different" given a large enough sample size.

So failing to reject the null hypothesis in a large sample doesn't necessarily mean that your sample parameter or distribution is poorly fit. It only means that the KS test thinks they are significantly different. And as some of us are fond of saying, statistical significance is not the same thing as practical significance.

Related Solutions

Kolmogorov-Smirnov Test – How to Perform a Kolmogorov-Smirnov Two-Sample Test

I am assuming you are asking because the Suanshu help page reports in reference to the K-S distribution, "This is not done yet." Luckily, it is very easy to do in R. If x and y are your two samples, ks.test(x,y) returns the test statistic and pvalue. For example,

> x <- rnorm(50)
> y <- runif(30)
> ks.test(x, y)    
        Two-sample Kolmogorov-Smirnov test    
data:  x and y 
D = 0.5, p-value = 9.065e-05
alternative hypothesis: two-sided

By default, it will compute exact or asymptotic p-values based on the product of the sample sizes (exact p-values for n.x*n.y < 10000 in the two-sample case), or you can specify this option with a third argument, exact=F or exact=T. Exact p-values are calculated using the methods of Marsaglia, et al. (2003), which the Suanshu documentation also cites. Some large sample approximations are given here, although I don't have a proper citation. Lastly, if you don't want to install R, there are web calculators for the two-sample K-S test, although I don't know if they use the same algorithm as R because the one I found only reported three decimal points for the p-value.

Solved – Using Kolmogorov–Smirnov test

1) The null hypothesis is that the data is distributed according to the theoretical distribution.

2) Let $N$ be your sample size, $D$ be the observed value of the Kolmogorov-Smirnov test statistic, and define $\lambda = D(0.12 + \sqrt{N} + 0.11 / \sqrt{N})$. Then the p-value for the test statistic is approximately:

$Q = 2 \sum_{j=1}^{\infty}(-1)^{j-1}\exp\{-2j^2\lambda^2\}$

Obviously you can't calculate the infinite sum, but if you sum over 100 values or so this will get you very, very, very close. This approximation is quite good even for small values of $N$, as low as 5 if I recall correctly, and gets better as $N$ increases. Note, however, that @whuber in comments proposes a better approach.

This is a perfectly reasonable alternative to the Shapiro-Wilk test I suggested in answer to your other question, by the way. Shapiro-Wilk is more powerful, but if your sample size is in the high hundreds, the Kolmogorov-Smirnov test will have quite a bit of power too.

Best Answer

Related Solutions

Kolmogorov-Smirnov Test – How to Perform a Kolmogorov-Smirnov Two-Sample Test

Solved – Using Kolmogorov–Smirnov test

Related Question