Solved – How does one measure the non-uniformity of a distribution

distributionsrandom variableuniform distributionvariance

I'm trying to come up with a metric for measuring non-uniformity of a distribution for an experiment I'm running. I have a random variable that should be uniformly distributed in most cases, and I'd like to be able to identify (and possibly measure the degree of) examples of data sets where the variable is not uniformly distributed within some margin.

An example of three data series each with 10 measurements representing frequency of the occurrence of something I'm measuring might be something like this:

a: [10% 11% 10%  9%  9% 11% 10% 10% 12%  8%]
b: [10% 10% 10%  8% 10% 10%  9%  9% 12%  8%]
c: [ 3%  2% 60%  2%  3%  7%  6%  5%  5%  7%]   <-- non-uniform
d: [98% 97% 99% 98% 98% 96% 99% 96% 99% 98%]

I'd like to be able to distinguish distributions like c from those like a and b, and measure c's deviation from a uniform distribution. Equivalently, if there's a metric for how uniform a distribution is (std. deviation close to zero?), I can perhaps use that to distinguish ones with high variance. However, my data may just have one or two outliers, like the c example above, and am not sure if that will be easily detectable that way.

I can hack something to do this in software, but am looking for statistical methods/approaches to justify this formally. I took a class years ago, but stats is not my area. This seems like something that should have a well-known approach. Sorry if any of this is completely bone-headed. Thanks in advance!

Best Answer

If you have not only the frequencies but the actual counts, you can use a $\chi^2$ goodness-of-fit test for each data series. In particular, you wish to use the test for a discrete uniform distribution. This gives you a good test, which allows you to find out which data series are likely not to have been generated by a uniform distribution, but does not provide a measure of uniformity.

There are other possible approaches, such as computing the entropy of each series - the uniform distribution maximizes the entropy, so if the entropy is suspiciously low you would conclude that you probably don't have a uniform distribution. That works as a measure of uniformity in some sense.

Another suggestion would be to use a measure like the Kullback-Leibler divergence, which measures the similarity of two distributions.

Related Solutions

Solved – Random Balls in Random Buckets: What are the characteristics of the distribution

Presuming each bucket is chosen with equal probability, and the balls are dropped independently, then the number of balls in each bucket would follow a multinomial distribution with $k$ trials and $p_1 = p_2 = \cdots = p_N = 1/N$. The probability of a particular bucket containing $x$ balls would be given by the binomial probability, which in this case is $\Pr(X=x) = \frac{k!}{x!(k-x)!} \frac{(N-1)^{k-x}}{N^k}$. You can use this to obtain $\Pr(X \geq x)$. If $k$ is large and $k/N$ is of reasonable magnitude, then we can use a Poisson distribution approximation (or a normal approximation).

The expected number of buckets containing exactly $x$ balls would just be $N\Pr(X=x)$. Of course, the totals in the buckets are not independent, but the totals of a few (say $M$) buckets will be approximately independent if $k$ is large compared to $M$. Thus we could approximate the variance of the number of buckets containing $x$ balls by $N\Pr(X=x) (1 - \Pr(X=x))$.

For example, if we have 1000 balls and 100 buckets, we'd expect 9.5 buckets to contain exactly 12 balls, but the standard deviation of this number is 2.9, so there is approximately a 95% probability it would lie between 4 and 15. (See R code).

> p = dbinom(12, size=1000, prob=1/100)
> 100*p
[1] 9.516152
> sqrt(100*p*(1-p))
[1] 2.934379
> 100*p + c(-1, 1)*1.96*sqrt(100*p*(1-p))
[1]  3.764769 15.267534

Solved – Measure for the uniformity of a distribution

First, note that your terminology is inconsistent. Here I take it that you have one variable (not several) consisting of a fixed number of categories and you are concerned with how categories with zero frequency or probability (not value) are handled.

Your $H$ is evidently $\sum p_i\ \text{log}_2\ (1/p_i)$ for probabilities or proportions $p_i$. The base used for logarithms does not affect any key principle here so we can think that we are summing terms $p_i\ \text{log}\ (1/p_i) = -p_i\ \text{log}\ p_i$.

The counter-argument to your worry is that entropy does take into account categories that have zero probability; it is just that they contribute zero to the entropy given that a strong convention that $-0\ \text{log}\ 0$ is evaluated as 0. A more informal version of the same argument is that the diversity or non-uniformity of what you do have in your collection is unaffected by what you don't have. If I have 10 elephants, spelling out that I have 0 giraffes or do not have any giraffes is incidental: what I have are 10 elephants. Any other statement about 0 frequencies adds no information (literally).

The same question of how to handle zero proportions arises with any measure. An alternative to entropy is based on squaring probabilities $\sum p_i^2$ and with such measures there is the same consequence that any $p_i$ that is 0 makes no difference to the sum.

You touch on a much more general issue of what can be inferred about a distribution from a summary measure. But any single summary measure is a irreversible reduction; you can't go back to the distribution unequivocally. This is on all fours with the point made in elementary statistics that a mean or correlation can reflect quite different data.

I suspect that the main issue here is that you are seeking a way to make entropy more intuitive and that is a legitimate concern. An easy way is to talk in terms of the "numbers equivalent". Calculate $2^H$ for your examples and you recover 5 for 10,10,10,10,10 and 1 for 10,0,0,0,0, which have the interpretation as the equivalent number of (equally common) categories that are present. For other examples, the result will be a non-integer, which is reasonable. For bases 10 or $e$, use $10^H$ or $\exp(H)$ to get the numbers equivalent.

P.S. I try to avoid asserting that something is meaningless unless I am totally sure that it is. I have found too often that I just didn't understand the argument.

EDIT 2016: If you know that (e.g.) 4 and only 4 categories are possible in principle, but only 3 occur, then that's pertinent information. Sometimes you know this: e.g. if cards can be $\{$spades, hearts, clubs, diamonds$\}$ and only some of those kinds occur, that's something to cite.

A measure of diversity that does take zeros into consideration, and is affected by whether zeros occur, has various names (e.g. dissimilarity index) and has general form $(1/2) \sum_{i=1}^S | p_i - q_i | =: D$ (say). Here $p_i$ is the observed proportion of category $i$ and $q_i$ is the proportion in a reference distribution, e.g. equal probabilities $q_i = 1/S$. Then the minimum occurs when the observed distribution is identical to the reference distribution and then $D = 0$. The maximum occurs when one proportion $p_i$ is $1$ and the others all zero. The achievable maximum depends on the number of categories $S$, which after all is part of the information. The concrete interpretation of $D$ is the minimum proportion that would need to change categories to reproduce the reference distribution.

Another example of a reference distribution would be the national distribution of different socio-economic classes or ethnic categories. Then $D = 0$ might mean that a local or regional community is a microcosm of the national and otherwise $D$ measures departure from that in some direction.

Best Answer

Related Solutions

Solved – Random Balls in Random Buckets: What are the characteristics of the distribution

Solved – Measure for the uniformity of a distribution

Related Question