Probability – Do Mixture Models Defy Entropy Principles?

entropymixture-distributionnormal distributionprobability

Recently, I have learned about the principle of Maximum Entropy with regards to Probability Distribution (https://www.youtube.com/watch?v=2gTrsLVnp9c) – in particular, when certain "information" (i.e. constraints) is available about some class of probability distribution function (e.g. domain over which the probability function is defined, expectation, etc.), we can use the principle of Maximum Entropy to determine the "most informative" probability distribution function from this class of probability distribution functions in this situation.

Apparently, in many real world situations (e.g. when the data is continuous and can take any value between negative infinity and positive infinity) – the Normal Distribution ends up being the probability distribution function with the Maximum Entropy, thus often resulting in the "most informative" choice of probability distribution function when compared to any other candidate.

However, by using the logic from the video : there any many real world situations in which the principle of Maximum Entropy would suggest that the Normal Distribution would be the most informative probability distribution function to use – but in reality, a "non normal" probability distribution function ends up being a better choice. In my opinion, a clear example of this are "Mixture Distributions" (e.g. Mixture Models). For example, there might be some instances in which the data might several distinct and "latent" groups and each of these groups might come from a distinct (normal) distribution – thus a (correctly specified) mixture distribution would be able to find some combination of different distributions and possibly result in a "better model" than a single normal distribution (where the normal distribution was the suggested candidate via the Maximum Entropy Principle). I could see this being the case even if we did not know apriori that our data contains several distinct latent groups.

My Question: Do Mixture Distributions defy the principle of Maximum Entropy – or could the principle of Maximum Entropy somehow still be used in such a context to suggest that Mixture Distributions are still the most "informative" distributions?

Best Answer

we can use the principle of Maximum Entropy to determine the "most informative" probability distribution function from this class of probability distribution functions in this situation.

More entropy is less information.

However, by using the logic from the video : there any many real world situations in which the principle of Maximum Entropy would suggest that the Normal Distribution would be the most informative probability distribution function to use - but in reality, a "non normal" probability distribution function ends up being a better choice.

If you choose a normal distribution as solution to some problem, but something else turns out the be the real answer, then possibly you did not have the correct information to solve the problem or you made the analysis incorrectly.

'being a better choice' is not a contradiction because you measure it differently in the two different cases.

Related Solutions

Outlier Detection – How to Detect Outliers in a Mixture of Gaussians Using Normal Distribution Models

I have suggested, in comments, that an "outlier" in this situation might be defined as a member of a "small" cluster centered at an "extreme" value. The meanings of the quoted terms need to be quantified, but apparently they can be: "small" would be a cluster of less than 10 values and "extreme" can be determined as outlying relative to the set of component means in the mixture model. In this case, outliers can be found with simple post-processing of any reasonable cluster analysis of the data.

Choices have to be made in fine-tuning this approach. These choices will depend on the nature of the data and therefore cannot be completely specified in a general answer like this. Instead, let's analyze some data. I use R due to its popularity on this site and succinctness (even compared to Python).

First, create some data as described in the question:

set.seed(17) # For reproducible results
centers <- rnorm(100, mean=100, sd=20)
x <- c(centers + rnorm(100*100, mean=0, sd=1), 
       rnorm(100, mean=250, sd=1), 
       rnorm(9, mean=300, sd=1))

This command specifies 102 components: 100 of them are situated like 100 independent draws from a normal(100, 20) distribution (and will therefore tend to lie between 50 and 150); one of them is centered at 250, and one is centered at 300. It then draws 100 values independently from each component (using a common standard deviation of 1) but, in the last component centered at 300, it draws only 9 values. According to the characterization of outliers, the 100 values centered at 250 do not constitute outliers: they should be viewed as a component of the mixture, albeit situated far from the others. However, one cluster of nine high values consists entirely of outliers. We need to detect these but no others.

Most omnibus univariate outlier-detection procedures would either not detect any of these 109 highest values or would indicate all 109 are outliers.

Suppose we have a good sense of the standard deviations of the components (obtained from prior information or from exploring the data). Use this to construct a kernel density estimate of the mixture:

d <- density(x, bw=1, n=1000)
plot(d, main="Kernel density")

KDE

The (almost invisible) blip at the extreme right qualifies as a set of outliers: its small area (less than 10/10109 = 0.001 of the total) indicates it consists of just a few values and its situation at one extreme of the x-axis earns it the appellation of "outlier" rather than "inlier." Checking these things is straightforward:

x0 <- d$x[d$y > 1000/length(x) * dnorm(5)]
gaps <- tail(x0, -1) - head(x0, -1)
histogram(gaps, main="Gap Counts")

Gap histogram

The density estimate d is represented by a 1D grid of 1000 bins. These commands have retained all bins in which the density is sufficiently large. For "large" I chose a very small value, to make sure that even the density of a single isolated value is picked up, but not so small that obviously separated components are merged.

Evidently the gap distribution has two high outliers (which can automatically be detected using any simple procedure, even an ad hoc one). One characterization is that they both exceed 25 (in this example). Let's find the values associated with them:

large.gaps <- gaps > 25
ranges <- rbind(tail(x0,-1)[large.gaps], c(tail(head(x0,-1)[large.gaps], -1), max(x))

The output is

         [,1]     [,2]
[1,] 243.9937 295.7732
[2,] 256.3758 300.9340

Within the range of data (from 25 to 301) these gaps determine two potential outlying ranges, one from 244 to 256 (column 1) and another from 296 to 301 (column 2). Let's see how many values lie within these ranges:

lapply(apply(ranges, 2, function(r){x[r[1] <= x & x <= r[2]]}), length)

The result is

[[1]]
[1] 100

[[2]]
[1] 9

The 100 is too large to be unusual: that's one of the components of the mixture. But the 9 is small enough. It remains to see whether any of these components might be considered outlying (as opposed to inlying):

apply(ranges, 2, mean)

The result is

[1] 250.1848 298.3536

The center of the 100-point cluster is at 250 and the center of the 9-point cluster is at 298, far enough from the rest of the data to constitute a cluster of outliers. We conclude there are nine outliers. Specifically, these are the values determined by column 2 of ranges,

x[ranges[1,2] <= x & x <= ranges[2,2]]

In order, they are

299.0379 300.0376 300.2696 300.3892 300.4250 300.5659 300.7018 300.8436 300.9340

Solved – Statistical interpretation of Maximum Entropy Distribution

This isn't really my field, so some musings:

I will start with the concept of surprise. What does it mean to be surprised? Usually, it means that something happened that was not expected to happen. So, surprise it a probabilistic concept and can be explicated as such (I J Good has written about that). See also Wikipedia and Bayesian Surprise.

Take the particular case of a yes/no situation, something can happen or not. It happens with probability $p$. Say, if p=0.9 and it happens, you are not really surprised. If $p=0.05$ and it happens, you are somewhat surprised. And if $p=0.0000001$ and it happens, you are really surprised. So, a natural measure of "surprise value in observed outcome" is some (anti)monotone function of the probability of what happened. It seems natural (and works well ...) to take the logarithm of probability of what happened, and then we throw in a minus sign to get a positive number. Also, by taking the logarithm we concentrate on the order of the surprise, and, in practice, probabilities are often only known up to order, more or less.

So, we define $$ \text{Surprise}(A) = -\log p(A) $$ where $A$ is the observed outcome, and $p(A)$ is its probability.

Now we can ask what is the expected surprise. Let $X$ be a Bernoulli random variable with probability $p$. It has two possibly outcomes, 0 and 1. The respective surprise values is $$\begin{align} \text{Surprise}(0) &= -\log(1-p) \\ \text{Surprise}(1) &= -\log p \end{align} $$ so the surprise when observing $X$ is itself a random variable with expectation $$ p \cdot -\log p + (1-p) \cdot -\log(1-p) $$ and that is --- surprise! --- the entropy of $X$! So entropy is expected surprise!

Now, this question is about maximum entropy. Why would anybody want to use a maximum entropy distribution? Well, it must be because they want to be maximally surprised! Why would anybody want that?

A way to look at it is the following: You want to learn about something, and to that goal you set up some learning experiences (or experiments ...). If you already knew everything about this topic, you are able to always predict perfectly, so are never surprised. Then you never get new experience, so do not learn anything new (but you know everything already---there is nothing to learn, so that is OK). In the more typical situation that you are confused, not able to predict perfectly, there is a learning opportunity! This leads to the idea that we can measure the "amount of possible learning" by the expected surprise, that is, entropy. So, maximizing entropy is nothing other than maximizing opportunity for learning. That sounds like a useful concept, which could be useful in designing experiments and such things.

A poetic example is the well known

Wenn einer eine reise macht, dann kann er was erzählen ...

One practical example: You want to design a system for online tests (online meaning that not everybody gets the same questions, the questions are chosen dynamically depending on previous answers, so optimized, in some way, for each person).

If you make too difficult questions, so they are never mastered, you learn nothing. That indicates you must lower the difficulty level. What is the optimal difficulty level, that is, the difficulty level which maximizes the rate of learning? Let the probability of correct answer be $p$. We want the value of $p$ that maximizes the Bernoulli entropy. But that is $p=0.5$. So you aim to state questions where the probability of obtaining a correct answer (from that person) is 0.5.

Then the case of a continuous random variable $X$. How can we be surprised by observing $X$? The probability of any particular outcome $\{X=x\}$ is zero, the $-\log p$ definition is useless. But we will be surprised if the probability of observing something like $x$ is small, that is, if the density function value $f(x)$ is small (assuming $f$ is continuous). That leads to the definition $$ \DeclareMathOperator{\E}{\mathbb{E}} \text{Surprise}(x) = -\log f(x) $$ With that definition, the expected surprise from observing $X$ is $$ \E \{-\log f(X)\} = -\int f(x) \log f(x) \; dx $$ that is, the expected surprise from observing $X$ is the differential entropy of $X$. It can also be seen as the expected negative loglikelihood.

But this isn't really the same as the first, event, case. Too see that, an example. Let the random variable $X$ represent the length of a throw of a stone (say in a sports competition). To measure that length we need to choose a length unit, since there is no intrinsic scale to length, as there is to probability. We could measure in mm or in km, or more usually, in meters. But our definition of surprise, hence expected surprise, depends on the unit chosen, so there is no invariance. For that reason, the values of differential entropy are not directly comparable the way that Shannon entropy is. It might still be useful, if one remembers this problem.

Best Answer

Related Solutions

Outlier Detection – How to Detect Outliers in a Mixture of Gaussians Using Normal Distribution Models

Solved – Statistical interpretation of Maximum Entropy Distribution

Related Question