Solved – Comparing two (or more) discrete distributions

discrete datadistributionsgoodness of fitkolmogorov-smirnov test

I would like to know what the most powerful way of comparing two (or more) discrete distributions is.

I know that the Kolmogorov-Smirnov test could be used (if corrected for the discrete ecdfs), and/or a chi-squared test, and other summary statistics could be compared (mean/variance/skewness &c), but is there a more powerful test along the lines of the Cramér–von_Mises test? There is unlikely to be much deviation across the whole of the discrete distributions, so I'd like the test to have as much power as possible, a situation that C-vM would be best suited for if the distributions were samples from a continuous distribution.

Some background:

Multiple machines generate strings of a fixed length with an added 'tail' (of an random integer number – say between 0 and 250 – of special characters). Changing the environment the machines sit in may (or may not) change the tail-length distributions.

Taking all the strings from the machines at different time points will give a time-varying distribution of special character tail lengths.

I'd like to know if we can test whether there are significant changes in the tail-length distribution over the time course.

Best Answer

It looks like you have a clear understanding of all the available tests. What I would suggest is if you would get the book, "Goodness-of-Fit-Techniques" by Ralph B. D'Agostino. It provides the background on the tests as well as good examples. Here a link to a pdf http://www.gbv.de/dms/ilmenau/toc/04207259X.PDF.

One thing you could add to list of methods to compare probability densities is Kullback–Leibler divergence. If you are a scientist you will see the relation between this method and entropy but that's not required. Here is just the introduction (explanation) of the approach which I took from Wikipedia "In probability theory and information theory, the Kullback–Leibler divergence (also information divergence, information gain, relative entropy, KLIC, or KL divergence) is a measure of the difference between two probability distributions P and Q..." This is covered in Matlab.

An "off-shoot" of Kullback-Leibler is the Jensen–Shannon divergence for probability distributions (this is a more common approach to comparing probability distrubtions (PD). This looks at the similarity of the PDs. Should be numerous references on this and is also covered in MatLab.

These methods aren't covered in the reference I gave you above.

I also found a good paper for you to look at. https://www.math.hmc.edu/~su/papers.dir/metrics.pdf, "On Choosing and Bounding Probability Metrics." The metrics covered are: Discrepancy, Hellinger distance, Kullback-Leibler divergence, Kolmogorov metric, Levy metric, Prokhorov metric, Separation distance, Total variation distance, Wasserstein (or Kantorovich) metric, and chi squared metric.

I should also mentioned the "q-q plot" (where q refers to quantile) as a simple way to compare 2 probability distributions (or compare data to a probability distribution).

Also one test that was left out earlier is the Anderson-Darling test. Here are two references for it: (1) http://www.win.tue.nl/~rmcastro/AppStat2013/files/lectures23.pdf (2) https://asaip.psu.edu/Articles/beware-the-kolmogorov-smirnov-test. The second reference goes over the problems you could encounter with Kolmogorov-Smirnov test.

Related Solutions

Solved – Kolmogorov-Smirnov with discrete data: What is proper use of dgof::ks.test in R

This is an answer to @jbrucks extension (but answers the original as well).

One general test of whether 2 samples come from the same population/distribution or if there is a difference is the permutation test. Choose a statistic of interest, this could be the KS test statistic or the difference of means or the difference of medians or the ratio of variances or ... (whatever is most meaningful for your question, you could do simulations under likely conditions to see which statistic gives you the best results) and compute that stat on the original 2 samples. Then you randomly permute the observations between the groups (group all the data points into one big pool, then randomly split them into 2 groups the same sizes as the original samples) and compute the statistic of interest on the permuted samples. Repeat this a bunch of times, the distribution of the sample statistics forms your null distribution and you compare the original statistic to this distribution to form the test. Note that the null hypothesis is that the distributions are identical, not just that the means/median/etc. are equal.

If you don't want to assume that the distributions are identical but want to test for a difference in means/medians/etc. then you could do a bootstrap.

If you know what distribution the data comes from (or at least are willing to assume a distribution) then you can do a liklihood ratio test on the equality of the parameters (compare the model with a single set of parameters over both groups to the model with seperate sets of parameters). The liklihood ratio test usually uses a chi-squared distribution which is fine in many cases (asymtotics), but if you are using small sample sizes or testing a parameter near its boundary (a variance being 0 for example) then the approximation may not be good, you could again use the permutation test to get a better null distribution.

These tests all work on either continuous or discrete distributions. You should also include some measure of power or a confidence interval to indicate the amount of uncertainty, a lack of significance could be due to low power or a statistically significant difference could still be practically meaningless.

Solved – Test identicality of discrete distributions

The issue with the Kolmogorov-Smirnov test and distributions that aren't continuous is that the possible permutations of the observations are not all equally likely, so the null distribution of the test statistic doesn't apply.

Indeed it's no longer distribution-free, and using the test "as is" is generally quite conservative (has a substantially lower type I error rate than the nominal rate - and correspondingly lower power).

One possibility is to use the statistic but actually compute the permutation distribution (in small samples) or sample from it (a randomization test).

The chi-square test tends to have low power against interesting alternatives because it ignores ordering. Smooth tests of goodness of fit (which in the simplest case can be treated as a partitioning of the chi-square into low-order components and an untested residual) don't ignore the ordering and tend to have better power. See, for example, the books by Rayner and Best (and others, in some cases).

To get the chi-square to work (though with ordered data I wouldn't do it this way, as I mentioned) you'll need to present it as a two-row (or -column) table of counts:

value:   0  1  2  3  4  5 
    X:   4  7  9  3  1  1 
    Y:   0  2  5  6 12  5

What you are doing is a test of homogeneity of proportions. For the chi-square, which conditions on both margins, this is identical to a test of independence.

So for this data frame, which I have called xycnt:

we just do this:

> chisq.test(xycnt)

    Pearson's Chi-squared test

data:  xycnt
X-squared = 20.6108, df = 5, p-value = 0.0009593

Warning message:
In chisq.test(xycnt) : Chi-squared approximation may be incorrect

In this case it complains because the expected counts in some cells are small. One solution is not to rely on the chi-square approximation to the test statistic but to simulate its distribution, obtaining a simulated p-value:

chisq.test(xycnt,simulate.p.value=TRUE,B=100000)

    Pearson's Chi-squared test with simulated p-value (based on 1e+05 replicates)

data:  xycnt
X-squared = 20.6108, df = NA, p-value = 0.00032

With such a small p-value, simulated estimates of it are a bit variable, but always small. You can always up the number of simulations further, it's pretty fast. (Ten million simulations generally give p-values between 0.00032 and 0.00033 and only take a few seconds)

Best Answer

Related Solutions

Solved – Kolmogorov-Smirnov with discrete data: What is proper use of dgof::ks.test in R

Solved – Test identicality of discrete distributions

Related Question