Solved – Testing randomly generated data against its intended distribution

distributionshypothesis testingrandom-generation

I have written a program which generates random data. If the program is working correctly, that data should follow a specific, known probability distribution. I would like to run the program, do some calculations on the result, and come up with a p-value.

Before anybody else says it: I understand that hypothesis testing cannot detect when the program is operating correctly. It can only detect when it is operating incorrectly in a specific way. (And even then, the test "should" fail X% of the time, depending on what significance level you choose…)

So, I am trying to gain an understanding of what tools might be appropriate. In particular:

  • I can generate as much random data as I want. All I have to do is leave the program running long enough. So I'm not limited to any specific sample size.

  • I'm interested in techniques which produce a p-value. So staring at a graph and saying "yes, that looks kinda linear" is not an interesting option. Unless there's some way of putting a hard number on the "wonkyness" of a graph. 😉

What I know so far:

  • I've seen three main sorts of test mentioned which sound like they might be applicable: [Pearson] chi-squared test, Kolmogorov-Smirnov test and Anderson-Darling test.

  • It appears that a chi-squared test is appropriate for discrete distributions, while the other two are more appropriate for continuous distributions. (?)

  • Various sources hint that the AD test is "better" than the KS test, but fail to go into any further detail.

Ultimately, all of these tests presumably detect "different ways" of deviating from the specified null distribution. But I don't really know what the differences are yet… In summary, I'm looking for some kind of general description of where each type of test is most applicable, and what sorts of problems it detects best.

Best Answer

Here is a general description of how the 3 methods mentioned work.

The Chi-Squared method works by comparing the number of observations in a bin to the number expected to be in the bin based on the distribution. For discrete distributions the bins are usually the discrete possibilities or combinations of those. For continuous distributions you can choose cut points to create the bins. Many functions that implement this will automatically create the bins, but you should be able to create your own bins if you want to compare in specific areas. The disadvantage of this method is that differences between the theoretical distribution and the empirical data that still put the values in the same bin will not be detected, an example would be rounding, if theoretically the numbers between 2 and 3 should be spread througout the range (we expect to see values like 2.34296), but in practice all those values are rounded to 2 or 3 (we don't even see a 2.5) and our bin includes the range from 2 to 3 inclusive, then the count in the bin will be similar to the theoretical prediction (this can be good or bad), if you want to detect this rounding you can just manually choose the bins to capture this.

The KS test statistic is the maximum distance between the 2 Cumulative Distribution Functions being compared (often a theoretical and an empirical). If the 2 probability distributions only have 1 intersection point then 1 minus the maximum distance is the area of overlap between the 2 probability distributions (this helps some people visualize what is being measured). Think of plotting on the same plot the theoretical distribution function and the EDF then measure the distance between the 2 "curves", the largest difference is the test statistic and it is compared against the distribution of values for this when the null is true. This captures differences is shape of the distribution or 1 distribution shifted or stretched compared to the other. It does not have a lot of power based on single outliers (if you take the maximum or minimum in the data and send it to Infinity or Negative Infinity then the maximum effect it will have on the test stat is $\frac1n$. This test depends on you knowing the parameters of the reference distribution rather than estimating them from the data (your situation seems fine here). If you estimate the parameters from the same data then you can still get a valid test by comparing to your own simulations rather than the standard reference distribution.

The Anderson-Darling test also uses the difference between the CDF curves like the KS test, but rather than using the maximum difference it uses a function of the total area between the 2 curves (it actually squares the differences, weights them so the tails have more influence, then integrates over the domain of the distributions). This gives more weight to outliers than KS and also gives more weight if there are several small differences (compared to 1 big difference that KS would emphasize). This may end up overpowering the test to find differences that you would consider unimportant (mild rounding, etc.). Like the KS test this assumes that you did not estimate parameters from the data.

Here is a graph to show the general ideas of the last 2:

enter image description here

based on this R code:

set.seed(1)
tmp <- rnorm(25)
edf <- approxfun( sort(tmp), (0:24)/25, method='constant', 
    yleft=0, yright=1, f=1 )

par(mfrow=c(3,1), mar=c(4,4,0,0)+.1)
curve( edf, from=-3, to=3, n=1000, col='green' )
curve( pnorm, from=-3, to=3, col='blue', add=TRUE)

tmp.x <- seq(-3, 3, length=1000)
ediff <- function(x) pnorm(x) - edf(x)
m.x <- tmp.x[ which.max( abs( ediff(tmp.x) ) ) ]
ediff( m.x )  # KS stat
segments( m.x, edf(m.x), m.x, pnorm(m.x), col='red' )  # KS stat

curve( ediff, from=-3, to=3, n=1000 )
abline(h=0, col='lightgrey')    

ediff2 <- function(x) (pnorm(x) - edf(x))^2/( pnorm(x)*(1-pnorm(x)) )*dnorm(x)
curve( ediff2, from=-3, to=3, n=1000 )
abline(h=0)

The top graph shows an EDF of a sample from a standard normal compared to the CDF of the standard normal with a line showing the KS stat. The middle graph then shows the difference in the 2 curves (you can see where the KS stat occurs). The bottom is then the squared, weighted difference, the AD test is based on the area under this curve (assuming I got everything correct).

Other tests look at the correlation in a qqplot, look at the slope in the qqplot, compare the mean, var, and other stats based on the moments.

Related Question