Solved – Test identicality of discrete distributions

chi-squared-testdiscrete datahypothesis testingkolmogorov-smirnov testr

Is there a standard way to test whether two vectors were drawn with the same discrete distribution in R? Something like a Kolmogorov-Smirnov test, but for discrete distributions.

I think two-sample chi-squared test would be appropriate. Does any package provide it? I can't get chisq.test to work for me.

Best Answer

The issue with the Kolmogorov-Smirnov test and distributions that aren't continuous is that the possible permutations of the observations are not all equally likely, so the null distribution of the test statistic doesn't apply.

Indeed it's no longer distribution-free, and using the test "as is" is generally quite conservative (has a substantially lower type I error rate than the nominal rate - and correspondingly lower power).

One possibility is to use the statistic but actually compute the permutation distribution (in small samples) or sample from it (a randomization test).

The chi-square test tends to have low power against interesting alternatives because it ignores ordering. Smooth tests of goodness of fit (which in the simplest case can be treated as a partitioning of the chi-square into low-order components and an untested residual) don't ignore the ordering and tend to have better power. See, for example, the books by Rayner and Best (and others, in some cases).


To get the chi-square to work (though with ordered data I wouldn't do it this way, as I mentioned) you'll need to present it as a two-row (or -column) table of counts:

value:   0  1  2  3  4  5 
    X:   4  7  9  3  1  1 
    Y:   0  2  5  6 12  5 

What you are doing is a test of homogeneity of proportions. For the chi-square, which conditions on both margins, this is identical to a test of independence.

So for this data frame, which I have called xycnt:

  x  y
0 4  0
1 7  2
2 9  5
3 3  6
4 1 12
5 1  5

we just do this:

> chisq.test(xycnt)

    Pearson's Chi-squared test

data:  xycnt
X-squared = 20.6108, df = 5, p-value = 0.0009593

Warning message:
In chisq.test(xycnt) : Chi-squared approximation may be incorrect

In this case it complains because the expected counts in some cells are small. One solution is not to rely on the chi-square approximation to the test statistic but to simulate its distribution, obtaining a simulated p-value:

chisq.test(xycnt,simulate.p.value=TRUE,B=100000)

    Pearson's Chi-squared test with simulated p-value (based on 1e+05 replicates)

data:  xycnt
X-squared = 20.6108, df = NA, p-value = 0.00032

With such a small p-value, simulated estimates of it are a bit variable, but always small. You can always up the number of simulations further, it's pretty fast. (Ten million simulations generally give p-values between 0.00032 and 0.00033 and only take a few seconds)