Solved – Even more with the Kolmogorov-Smirnov test with R software

kolmogorov-smirnov testmultinomial-distributionr

This follows on from the previous question on differences between K-S manual test and K-S test with R.

My frequency sample was

a=c(0,1,1,4,9).

Then the observed sample is

 obs=c(2,3,4,4,4,4,5,5,5,5,5,5,5,5,5)

The expected sample is then

exp=c(1,1,1,2,2,2,3,3,3,4,4,4,5,5,5)

I hope you agree.

First, I use ks.test, like another time:

ks.test(obs,exp)

data:  oss and att

D = 0.4667, p-value = 0.07626

Then, I use the ks.test the other way:

The expected distribution can be the uniform. Do you agree?

And then:

ks.test(obs, "punif", 0,5)

data:  obs 

D = 0.6667, p-value = 3.239e-06

Question

Why do the two approaches give different results?

Best Answer

The first is a two-sample test; the second is a one-sample test against a continuous distribution. Neither is used correctly:

The two-sample test views both sets of data as being data, but your "expected sample" is not data, it's a theoretical reference. It is not subject to any variation. The two-sample test thinks that it can vary. That's why the p-value is so large.
The reference distribution used in the one-sample test is a continuous uniform distribution between 0 and 5. However, these data look discrete: from the way they are given, it appears they can attain only the values 1, 2, ..., 5. Because the one-sample test doesn't know this, its p-value is probably too small.

At least this lets us infer that the correct p-value should lie somewhere between 0.076 and 3.2e-06. Because that doesn't settle the question, let's analyze further.

To get a sense of whether the data (0, 1, 1, 4, 9) differ significantly from the discrete uniform frequencies (3, 3, 3, 3, 3), view the latter as describing a five-sided die. What are the chances that in 0+1+...+9 = 15 tosses of this die that at least one value would appear 9 or more times? The events (1 appears 9 or more times), (2 appears 9 or more times), ..., (5 appears 9 or more times) are mutually exclusive--no two of them can hold at once--so their probabilities add. Because the die is uniform each of these five events has the same probability. We can compute the chance that a 5 comes up 9 or more times by viewing it like tosses of a biased coin: a 5 has a 1/5 chance; a non-5 has a 4/5 chance. The chance of 9 or more 5's therefore equals

$$\binom{15}{9}(1/5)^9(4/5)^6 + \binom{15}{10}(1/5)^{10}(4/5)^5 + \cdots + \binom{15}{15}(1/5)^{15}(1/4)^0.$$

This value is approximately 0.000785. Multiplying by 5 gives .00392 = 0.39%, still quite small. Thus this set of frequencies is unlikely to have arisen through a single experiment in which each of the values has an equal chance of arising.

Related Solutions

Solved – Difference between K-S manual test and K-S test with R

You are testing a different thing.

While you think c(0,1,1,9,4) means you are looking at 0 values of one, 1 value of two, 1 value of three, 9 values of four, and 4 values of five, R thinks you are looking at one value of 0, two values of 1, one value of 9, and one value of 4.

To get D = 0.4667..., try the rather verbose

ks.test( c(2,3,4,4,4,4,4,4,4,4,4,5,5,5,5), 
         c(1,1,1,2,2,2,3,3,3,4,4,4,5,5,5) )

giving

    Two-sample Kolmogorov-Smirnov test 

D = 0.4667, p-value = 0.07626 
alternative hypothesis: two-sided

Kolmogorov-Smirnov Test – Proper Use with dgof::ks.test in R for Discrete Data

This is an answer to @jbrucks extension (but answers the original as well).

One general test of whether 2 samples come from the same population/distribution or if there is a difference is the permutation test. Choose a statistic of interest, this could be the KS test statistic or the difference of means or the difference of medians or the ratio of variances or ... (whatever is most meaningful for your question, you could do simulations under likely conditions to see which statistic gives you the best results) and compute that stat on the original 2 samples. Then you randomly permute the observations between the groups (group all the data points into one big pool, then randomly split them into 2 groups the same sizes as the original samples) and compute the statistic of interest on the permuted samples. Repeat this a bunch of times, the distribution of the sample statistics forms your null distribution and you compare the original statistic to this distribution to form the test. Note that the null hypothesis is that the distributions are identical, not just that the means/median/etc. are equal.

If you don't want to assume that the distributions are identical but want to test for a difference in means/medians/etc. then you could do a bootstrap.

If you know what distribution the data comes from (or at least are willing to assume a distribution) then you can do a liklihood ratio test on the equality of the parameters (compare the model with a single set of parameters over both groups to the model with seperate sets of parameters). The liklihood ratio test usually uses a chi-squared distribution which is fine in many cases (asymtotics), but if you are using small sample sizes or testing a parameter near its boundary (a variance being 0 for example) then the approximation may not be good, you could again use the permutation test to get a better null distribution.

These tests all work on either continuous or discrete distributions. You should also include some measure of power or a confidence interval to indicate the amount of uncertainty, a lack of significance could be due to low power or a statistically significant difference could still be practically meaningless.

Question

Best Answer

Related Solutions

Solved – Difference between K-S manual test and K-S test with R

Kolmogorov-Smirnov Test – Proper Use with dgof::ks.test in R for Discrete Data

Related Question