Solved – Interpreting Shannon entropy

data visualizationentropyinterpretation

From a computer simulation I have built a histogram of the results and normalized it so that the probability of finding a point $X$ in bin $b_j$ is
$\sum_j P(X \in b_j) =1$.
From this I have calculated the histogram's Shannon entropy $H$ in order to have some way to quantify the "predictivity" of $P$.

Now, while I get a number easily enough, I'm having a hard time understanding what I should do with it. My first thought was to compare $H$ for $P$ versus $H$ for the uniform distribution over the same $X$-range, since this has the maximal entropy (we know $X$ must belong to a finite range). Or I could compare the $X$-range to some "effective volume" $\Delta X$, where $\Delta X$ is the range over which a uniform distribution with the same $H$ as my histogram has been defined. I freely admit these aren't wonderful comparisons, since my histograms don't look at all like uniform distributions.

I work in a field that does not regularly use $H$ as a statistic, so I can't just give my reader a number and be done with it. However, I know it's a valuable quantity for my histogram. My question is: How would you report, describe, and compare the Shannon entropy for experimental/simulated histograms?

Best Answer

It depends what you want to show, what is the variable:

  • categorical variable - it's fine
  • discrete by ordinal - it's a bit tricky
    • e.g. on 1-5 scale it is something different to have the same probabilities for 1 and 5, and for 3 and 4
  • continuos variable - it's even more tricky
    • the previous argument
    • the choice of coordinates matter (good coordinates are ones respecting symmetries (and they not always exist))
    • changing bin size scales entropy

So, I will mostly focus of the categorical variant.

Typical quantity you can use is Kullback-Leibler divergence, which means how different is your probability distribution $Y$ with respect to some initial one $X$.

$$ D_{KL}(Y||X) = \sum_x P(Y=x) \log \left(\frac{P(Y=x)}{P(X=x)} \right) $$ It can be interpreted as information gain - expecting probability distribution $X$ how much information you gained when you measured probability $Y$. If $X$ is uniform, then the KL divergence is just entropy of $X$ minus entropy of $Y$.

As an example, when you expect a coin to be fair $X=(\tfrac{1}{2}, \tfrac{1}{2})$, you toss it and get heads (and you are sure) $Y=(1,0)$, you learn exactly one bit of information.

When it comes to setting "uninformed" probability - it depends on the problem. For discrete case, just take maximum entropy distribution given the constraints. If there are no constraints, it is simply uniform probability. For linear constraints (that is, that some averages are fixed) there is a simple recipe to compute such distribution.

If there are a few different models, you can compare them measuring against the same $X$. The same can work for some ad hoc assumptions (for example uniform on some set, zero elsewhere).

If you have to normalize it, divide by entropy of the uninformed probability distribution $X$.

EDIT:

If you want to just tell how concentrated is the distribution, just use entropy of $Y$ (comparing it to entropy of $X$). In this case, lower is better.

Related Question