Solved – How does one measure the non-uniformity of a distribution

distributionsrandom variableuniform distributionvariance

I'm trying to come up with a metric for measuring non-uniformity of a distribution for an experiment I'm running. I have a random variable that should be uniformly distributed in most cases, and I'd like to be able to identify (and possibly measure the degree of) examples of data sets where the variable is not uniformly distributed within some margin.

An example of three data series each with 10 measurements representing frequency of the occurrence of something I'm measuring might be something like this:

a: [10% 11% 10%  9%  9% 11% 10% 10% 12%  8%]
b: [10% 10% 10%  8% 10% 10%  9%  9% 12%  8%]
c: [ 3%  2% 60%  2%  3%  7%  6%  5%  5%  7%]   <-- non-uniform
d: [98% 97% 99% 98% 98% 96% 99% 96% 99% 98%]

I'd like to be able to distinguish distributions like c from those like a and b, and measure c's deviation from a uniform distribution. Equivalently, if there's a metric for how uniform a distribution is (std. deviation close to zero?), I can perhaps use that to distinguish ones with high variance. However, my data may just have one or two outliers, like the c example above, and am not sure if that will be easily detectable that way.

I can hack something to do this in software, but am looking for statistical methods/approaches to justify this formally. I took a class years ago, but stats is not my area. This seems like something that should have a well-known approach. Sorry if any of this is completely bone-headed. Thanks in advance!

Best Answer

If you have not only the frequencies but the actual counts, you can use a $\chi^2$ goodness-of-fit test for each data series. In particular, you wish to use the test for a discrete uniform distribution. This gives you a good test, which allows you to find out which data series are likely not to have been generated by a uniform distribution, but does not provide a measure of uniformity.

There are other possible approaches, such as computing the entropy of each series - the uniform distribution maximizes the entropy, so if the entropy is suspiciously low you would conclude that you probably don't have a uniform distribution. That works as a measure of uniformity in some sense.

Another suggestion would be to use a measure like the Kullback-Leibler divergence, which measures the similarity of two distributions.

Related Question