Solved – Measure for the uniformity of a distribution

distributionsmeasurementrandom variableuniform distribution

I can't seem to find a well established and simple statistical measure of uniformity in occurrence datasets in the presence of zero-valued categories. I've looked at Shannon's entropy which seems to be the closest but the problem is that it can't tell me anything about variables which have no occurrences in them.

I always have a set number of variables e.g. 5 with each consisting of 0-10 occurrences.
Although entropy can tell me about the distribution across all variables if they all have more than zero occurrences, in the presence of zero it becomes somewhat meaningless.

Had a look at How does one measure the non-uniformity of a distribution? but I believe this is a different case.

In my case I have a minimum and maximum number of occurrences that can occur in each category, (0-10) and a fixed number of these categories (5). Can anyone point me in the right direction?

Example;

10 10 10 10 10 
H(X) = 2.32193 and metric entropy = 0.04644

Ideally a metric would give me an extreme value of 1 for uniformity

10  0  0  0  0 
H(X) = 0 and metric entropy = 0

Ideally a metric would give me an extreme value (or near it) of 0 for non-uniformity

10 5  0  0  0
H(X) = 0.9183 and metric entropy = 0.06122

Many thanks and sorry for my lack of statistical knowledge

Best Answer

First, note that your terminology is inconsistent. Here I take it that you have one variable (not several) consisting of a fixed number of categories and you are concerned with how categories with zero frequency or probability (not value) are handled.

Your $H$ is evidently $\sum p_i\ \text{log}_2\ (1/p_i)$ for probabilities or proportions $p_i$. The base used for logarithms does not affect any key principle here so we can think that we are summing terms $p_i\ \text{log}\ (1/p_i) = -p_i\ \text{log}\ p_i$.

The counter-argument to your worry is that entropy does take into account categories that have zero probability; it is just that they contribute zero to the entropy given that a strong convention that $-0\ \text{log}\ 0$ is evaluated as 0. A more informal version of the same argument is that the diversity or non-uniformity of what you do have in your collection is unaffected by what you don't have. If I have 10 elephants, spelling out that I have 0 giraffes or do not have any giraffes is incidental: what I have are 10 elephants. Any other statement about 0 frequencies adds no information (literally).

The same question of how to handle zero proportions arises with any measure. An alternative to entropy is based on squaring probabilities $\sum p_i^2$ and with such measures there is the same consequence that any $p_i$ that is 0 makes no difference to the sum.

You touch on a much more general issue of what can be inferred about a distribution from a summary measure. But any single summary measure is a irreversible reduction; you can't go back to the distribution unequivocally. This is on all fours with the point made in elementary statistics that a mean or correlation can reflect quite different data.

I suspect that the main issue here is that you are seeking a way to make entropy more intuitive and that is a legitimate concern. An easy way is to talk in terms of the "numbers equivalent". Calculate $2^H$ for your examples and you recover 5 for 10,10,10,10,10 and 1 for 10,0,0,0,0, which have the interpretation as the equivalent number of (equally common) categories that are present. For other examples, the result will be a non-integer, which is reasonable. For bases 10 or $e$, use $10^H$ or $\exp(H)$ to get the numbers equivalent.

P.S. I try to avoid asserting that something is meaningless unless I am totally sure that it is. I have found too often that I just didn't understand the argument.

EDIT 2016: If you know that (e.g.) 4 and only 4 categories are possible in principle, but only 3 occur, then that's pertinent information. Sometimes you know this: e.g. if cards can be $\{$spades, hearts, clubs, diamonds$\}$ and only some of those kinds occur, that's something to cite.

A measure of diversity that does take zeros into consideration, and is affected by whether zeros occur, has various names (e.g. dissimilarity index) and has general form $(1/2) \sum_{i=1}^S | p_i - q_i | =: D$ (say). Here $p_i$ is the observed proportion of category $i$ and $q_i$ is the proportion in a reference distribution, e.g. equal probabilities $q_i = 1/S$. Then the minimum occurs when the observed distribution is identical to the reference distribution and then $D = 0$. The maximum occurs when one proportion $p_i$ is $1$ and the others all zero. The achievable maximum depends on the number of categories $S$, which after all is part of the information. The concrete interpretation of $D$ is the minimum proportion that would need to change categories to reproduce the reference distribution.

Another example of a reference distribution would be the national distribution of different socio-economic classes or ethnic categories. Then $D = 0$ might mean that a local or regional community is a microcosm of the national and otherwise $D$ measures departure from that in some direction.