Solved – Kolmogorov-Smirnov with discrete data: What is proper use of dgof::ks.test in R

discrete datagoodness of fitkolmogorov-smirnov testr

Beginner questions:

I want to test whether two discrete data sets come from the same distribution. A Kolmogorov-Smirnov test was suggested to me.

Conover (Practical Nonparametric Statistics, 3d) seems to say that the Kolmogorov-Smirnov Test can be used for this purpose, but its behavior is "conservative" with discrete distributions, and I'm not sure what that means here.

DavidR's comment on another question says "… You can still make a level α test based on the K-S statistic, but you'll have to find some other method to get the critical value, e.g. by simulation."

The version of ks.test() in the dgof R package (article, cran) adds some capabilities not present in the default version of ks.test() in the stats package. Among other things, dgof::ks.test includes this parameter:

simulate.p.value: a logical indicating whether to compute p-values by
Monte Carlo simulation, for discrete goodness-of-fit tests only.

Is the purpose of simulate.p.value=T to accomplish what DavidR suggests?

Even if it is, I'm not sure whether I can really use dgof::ks.test for a two-sample test. It looks like it only provides a two-sample test for a continuous distribution:

If y is numeric, a two-sample test of the null hypothesis that x and y
were drawn from the same continuous distribution is performed.

Alternatively, y can be a character string naming a continuous
(cumulative) distribution function (or such a function), or an ecdf
function (or object of class stepfun) giving a discrete distribution.
In these cases, a one-sample test is carried out of the null that the
distribution function which generated x is distribution y ….

(Background details: Strictly speaking, my underlying distributions are continuous, but the data tend to lie very near to a handful of points. Each point is the result of a simulation, and is a mean of 10 or 20 real numbers between -1 and 1. By the end of the simulation, those numbers are nearly always very close to .9 or -.9. Thus the means
cluster around a few values, and I am treating them as discrete. The simulation is complex, and I have no reason to think that the data follow a well-known distribution.)

Advice?

Best Answer

This is an answer to @jbrucks extension (but answers the original as well).

One general test of whether 2 samples come from the same population/distribution or if there is a difference is the permutation test. Choose a statistic of interest, this could be the KS test statistic or the difference of means or the difference of medians or the ratio of variances or ... (whatever is most meaningful for your question, you could do simulations under likely conditions to see which statistic gives you the best results) and compute that stat on the original 2 samples. Then you randomly permute the observations between the groups (group all the data points into one big pool, then randomly split them into 2 groups the same sizes as the original samples) and compute the statistic of interest on the permuted samples. Repeat this a bunch of times, the distribution of the sample statistics forms your null distribution and you compare the original statistic to this distribution to form the test. Note that the null hypothesis is that the distributions are identical, not just that the means/median/etc. are equal.

If you don't want to assume that the distributions are identical but want to test for a difference in means/medians/etc. then you could do a bootstrap.

If you know what distribution the data comes from (or at least are willing to assume a distribution) then you can do a liklihood ratio test on the equality of the parameters (compare the model with a single set of parameters over both groups to the model with seperate sets of parameters). The liklihood ratio test usually uses a chi-squared distribution which is fine in many cases (asymtotics), but if you are using small sample sizes or testing a parameter near its boundary (a variance being 0 for example) then the approximation may not be good, you could again use the permutation test to get a better null distribution.

These tests all work on either continuous or discrete distributions. You should also include some measure of power or a confidence interval to indicate the amount of uncertainty, a lack of significance could be due to low power or a statistically significant difference could still be practically meaningless.