Solved – Statistical interpretation of Maximum Entropy Distribution

distributionsentropyinformation theoryintuitionmaximum-entropy

I have used the principle of maximum entropy to justify the use of several distributions in various settings; however, I have yet to be able to formulate a statistical, as opposed to information-theoretic, interpretation of maximum entropy. In other words, what does maximizing the entropy imply about the statistical properties of the distribution?

Has anyone run across or perhaps discovered yourself a statistical interpretation of max. entropy distributions that does not appeal to information, but only to probabilistic concepts?

As an example of such an interpretation (not necessarily true): "For an interval of arbitrary length L on the domain of the RV (assuming its 1-d continuous for simplicity), the maximum probability that can be contained in this interval is minimized by the maximum entropy distribution."

So, you see there is no talk about "informativeness" or other more philosophical ideas, just probabilistic implications.

Best Answer

This isn't really my field, so some musings:

I will start with the concept of surprise. What does it mean to be surprised? Usually, it means that something happened that was not expected to happen. So, surprise it a probabilistic concept and can be explicated as such (I J Good has written about that). See also Wikipedia and Bayesian Surprise.

Take the particular case of a yes/no situation, something can happen or not. It happens with probability $p$. Say, if p=0.9 and it happens, you are not really surprised. If $p=0.05$ and it happens, you are somewhat surprised. And if $p=0.0000001$ and it happens, you are really surprised. So, a natural measure of "surprise value in observed outcome" is some (anti)monotone function of the probability of what happened. It seems natural (and works well ...) to take the logarithm of probability of what happened, and then we throw in a minus sign to get a positive number. Also, by taking the logarithm we concentrate on the order of the surprise, and, in practice, probabilities are often only known up to order, more or less.

So, we define $$ \text{Surprise}(A) = -\log p(A) $$ where $A$ is the observed outcome, and $p(A)$ is its probability.

Now we can ask what is the expected surprise. Let $X$ be a Bernoulli random variable with probability $p$. It has two possibly outcomes, 0 and 1. The respective surprise values is $$\begin{align} \text{Surprise}(0) &= -\log(1-p) \\ \text{Surprise}(1) &= -\log p \end{align} $$ so the surprise when observing $X$ is itself a random variable with expectation $$ p \cdot -\log p + (1-p) \cdot -\log(1-p) $$ and that is --- surprise! --- the entropy of $X$! So entropy is expected surprise!

Now, this question is about maximum entropy. Why would anybody want to use a maximum entropy distribution? Well, it must be because they want to be maximally surprised! Why would anybody want that?

A way to look at it is the following: You want to learn about something, and to that goal you set up some learning experiences (or experiments ...). If you already knew everything about this topic, you are able to always predict perfectly, so are never surprised. Then you never get new experience, so do not learn anything new (but you know everything already---there is nothing to learn, so that is OK). In the more typical situation that you are confused, not able to predict perfectly, there is a learning opportunity! This leads to the idea that we can measure the "amount of possible learning" by the expected surprise, that is, entropy. So, maximizing entropy is nothing other than maximizing opportunity for learning. That sounds like a useful concept, which could be useful in designing experiments and such things.

A poetic example is the well known

Wenn einer eine reise macht, dann kann er was erzählen ...

One practical example: You want to design a system for online tests (online meaning that not everybody gets the same questions, the questions are chosen dynamically depending on previous answers, so optimized, in some way, for each person).

If you make too difficult questions, so they are never mastered, you learn nothing. That indicates you must lower the difficulty level. What is the optimal difficulty level, that is, the difficulty level which maximizes the rate of learning? Let the probability of correct answer be $p$. We want the value of $p$ that maximizes the Bernoulli entropy. But that is $p=0.5$. So you aim to state questions where the probability of obtaining a correct answer (from that person) is 0.5.

Then the case of a continuous random variable $X$. How can we be surprised by observing $X$? The probability of any particular outcome $\{X=x\}$ is zero, the $-\log p$ definition is useless. But we will be surprised if the probability of observing something like $x$ is small, that is, if the density function value $f(x)$ is small (assuming $f$ is continuous). That leads to the definition $$ \DeclareMathOperator{\E}{\mathbb{E}} \text{Surprise}(x) = -\log f(x) $$ With that definition, the expected surprise from observing $X$ is $$ \E \{-\log f(X)\} = -\int f(x) \log f(x) \; dx $$ that is, the expected surprise from observing $X$ is the differential entropy of $X$. It can also be seen as the expected negative loglikelihood.

But this isn't really the same as the first, event, case. Too see that, an example. Let the random variable $X$ represent the length of a throw of a stone (say in a sports competition). To measure that length we need to choose a length unit, since there is no intrinsic scale to length, as there is to probability. We could measure in mm or in km, or more usually, in meters. But our definition of surprise, hence expected surprise, depends on the unit chosen, so there is no invariance. For that reason, the values of differential entropy are not directly comparable the way that Shannon entropy is. It might still be useful, if one remembers this problem.

Related Question