Solved – Comparing two (or more) discrete distributions

discrete datadistributionsgoodness of fitkolmogorov-smirnov test

I would like to know what the most powerful way of comparing two (or more) discrete distributions is.

I know that the Kolmogorov-Smirnov test could be used (if corrected for the discrete ecdfs), and/or a chi-squared test, and other summary statistics could be compared (mean/variance/skewness &c), but is there a more powerful test along the lines of the Cramér–von_Mises test? There is unlikely to be much deviation across the whole of the discrete distributions, so I'd like the test to have as much power as possible, a situation that C-vM would be best suited for if the distributions were samples from a continuous distribution.

Some background:

Multiple machines generate strings of a fixed length with an added 'tail' (of an random integer number – say between 0 and 250 – of special characters). Changing the environment the machines sit in may (or may not) change the tail-length distributions.

Taking all the strings from the machines at different time points will give a time-varying distribution of special character tail lengths.

I'd like to know if we can test whether there are significant changes in the tail-length distribution over the time course.

Best Answer

It looks like you have a clear understanding of all the available tests. What I would suggest is if you would get the book, "Goodness-of-Fit-Techniques" by Ralph B. D'Agostino. It provides the background on the tests as well as good examples. Here a link to a pdf http://www.gbv.de/dms/ilmenau/toc/04207259X.PDF.

One thing you could add to list of methods to compare probability densities is Kullback–Leibler divergence. If you are a scientist you will see the relation between this method and entropy but that's not required. Here is just the introduction (explanation) of the approach which I took from Wikipedia "In probability theory and information theory, the Kullback–Leibler divergence (also information divergence, information gain, relative entropy, KLIC, or KL divergence) is a measure of the difference between two probability distributions P and Q..." This is covered in Matlab.

An "off-shoot" of Kullback-Leibler is the Jensen–Shannon divergence for probability distributions (this is a more common approach to comparing probability distrubtions (PD). This looks at the similarity of the PDs. Should be numerous references on this and is also covered in MatLab.

These methods aren't covered in the reference I gave you above.

I also found a good paper for you to look at. https://www.math.hmc.edu/~su/papers.dir/metrics.pdf, "On Choosing and Bounding Probability Metrics." The metrics covered are: Discrepancy, Hellinger distance, Kullback-Leibler divergence, Kolmogorov metric, Levy metric, Prokhorov metric, Separation distance, Total variation distance, Wasserstein (or Kantorovich) metric, and chi squared metric.

I should also mentioned the "q-q plot" (where q refers to quantile) as a simple way to compare 2 probability distributions (or compare data to a probability distribution).

Also one test that was left out earlier is the Anderson-Darling test. Here are two references for it: (1) http://www.win.tue.nl/~rmcastro/AppStat2013/files/lectures23.pdf (2) https://asaip.psu.edu/Articles/beware-the-kolmogorov-smirnov-test. The second reference goes over the problems you could encounter with Kolmogorov-Smirnov test.

Related Question