Solved – How to measure the distance (or divergence – not sure) between data and a probability distribution

distance-functionsdistributionsprobability

If I have generated a set of random data and I wish to measure how well these data fit, e.g. a uniform probability distribution, what are the standard ways to do that?

I am not very experienced with this sort of data processing. One thing that comes to mind is Kullback-Leibler divergence, but there you measure distance between two p.d.f.s. Here I have one p.d.f. and then a set of data I would like to compare it to, but which I can probably at best represent as a histogram. I have seen people mentioning the Kolmogorov-Smirnov statistic. What that be suitable here?

Best Answer

There are goodness of fit tests (many, many), including the Kolmogorov-Smirnov test you mentioned; many goodness of fit test statistics can be seen as a measure of discrepancy between data and some distribution.

The most obvious ones to consider first are the empirical-CDF-based tests, of which the Kolmogorov-Smirnov test is the simplest.

The Kolmogorov-Smirnov test statistic is the largest distance between the hypothesized cdf and the ECDF of the data (or sometimes a standardized version of it)

enter image description here

There are many others based on the ECDF. One example is the Cramer-von Mises test which (at least for the uniform) corresponds to a sum of squares of vertical distances between the two cdfs.

Another related measure is the Anderson-Darling statistic, which weights by the inverse of the variance of the ECDF.

There are many other kinds of tests. One common one is the the Shapiro-Wilk test, for example. The Shapiro-Wilk (and related tests) can also be treated as a measure of discrepancy between the distribution of the data and some hypothesized distribution (in the case of the Shapiro-Wilk test, that's the normal distribution). The Shapiro-Wilk, however, is invariant to changes in mean and variance.

A full coverage of the goodness of fit territory could take a book -- indeed has taken several.