Solved – Jensen-Shannon divergence for bivariate normal distributions

distance-functionsinformation theorynormal distribution

Given two bivariate normal distributions $P \equiv \mathcal{N}(\mu_p, \Sigma_p)$ and $Q \equiv \mathcal{N}(\mu_q, \Sigma_q)$, I am trying to calculate the Jensen-Shannon divergence between them, defined (for the discrete case) as:
$JSD(P\|Q) = \frac{1}{2} (KLD(P\|M)+ KLD(Q\|M))$
where $KLD$ is the Kullback-Leibler divergence, and $M=\frac{1}{2}(P+Q)$
I've found the way to calculate $KLD$ in terms of the distributions' parameters, and thus $JSD$.

My doubts are:

  1. To calculate $M$, I just did $M \equiv \mathcal{N}(\frac{1}{2}(\mu_p + \mu_q), \frac{1}{2}(\Sigma_p + \Sigma_q))$. Is this right?

  2. I've read in [1] that the $JSD$ is bounded, but that doesn't appear to be true when I calculate it as described above for normal distributions. Does it mean I am calculating it wrong, violating an assumption, or something else I don't understand?

Best Answer

The midpoint measure $\newcommand{\bx}{\mathbf{x}} \newcommand{\KL}{\mathrm{KL}}M$ is a mixture distribution of the two multivariate normals, so it does not have the form that you give in the original post. Let $\varphi_p(\bx)$ be the probability density function of a $\mathcal{N}(\mu_p, \Sigma_p)$ random vector and $\varphi_q(\bx)$ be the pdf of $\mathcal{N}(\mu_q, \Sigma_q)$. Then the pdf of the midpoint measure is $$ \varphi_m(\bx) = \frac{1}{2} \varphi_p(\bx) + \frac{1}{2} \varphi_q(\bx) \> . $$

The Jensen-Shannon divergence is $$ \mathrm{JSD} = \frac{1}{2} (\KL(P\,\|M)+ \KL(Q\|M)) = h(M) - \frac{1}{2} (h(P) + h(Q)) \>, $$ where $h(P)$ denotes the (differential) entropy corresponding to the measure $P$.

Thus, your calculation reduces to calculating differential entropies. For the multivariate normal $\mathcal{N}(\mu, \Sigma)$, the answer is well-known to be $$ \frac{1}{2} \log_2\big((2\pi e)^n |\Sigma|\big) $$ and the proof can be found in any number of sources, e.g., Cover and Thomas (1991), pp. 230-231. It is worth pointing out that the entropy of a multivariate normal is invariant with respect to the mean, as the expression above shows. However, this almost assuredly does not carry over to the case of a mixture of normals. (Think about picking one broad normal centered at zero and another concentrated normal where the latter is pushed out far away from the origin.)

For the midpoint measure, things appear to be more complicated. That I know of, there is no closed-form expression for the differential entropy $h(M)$. Searching on Google yields a couple potential hits, but the top ones don't appear to give closed forms in the general case. You may be stuck with approximating this quantity in some way.

Note also that the paper you reference does not restrict the treatment to only discrete distributions. They treat a case general enough that your problem falls within their framework. See the middle of column two on page 1859. Here is where it is also shown that the divergence is bounded. This holds for the case of two general measures and is not restricted to the case of two discrete distributions.

The Jensen-Shannon Divergence has come up a couple of times recently in other questions on this site. See here and here.


Addendum: Note that a mixture of normals is not the same as a linear combination of normals. The simplest way to see this is to consider the one-dimensional case. Let $X_1 \sim \mathcal{N}(-\mu, 1)$ and $X_2 \sim \mathcal{N}(\mu, 1)$ and let them be independent of one another. Then a mixture of the two normals using weights $(\alpha, 1-\alpha)$ for $\alpha \in (0,1)$ has the distribution $$ \varphi_m(x) = \alpha \cdot \frac{1}{\sqrt{2\pi}} e^{-\frac{(x+\mu)^2}{2}} + (1-\alpha) \cdot \frac{1}{\sqrt{2\pi}} e^{-\frac{(x-\mu)^2}{2}} \> . $$

The distribution of a linear combination of $X_1$ and $X_2$ using the same weights as before is, via the stable property of the normal distribution is $$ \varphi_{\ell}(x) = \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{(x-(1-2\alpha)\mu)^2}{2\sigma^2}} \>, $$ where $\sigma^2 = \alpha^2 + (1-\alpha)^2$.

These two distributions are very different, though they have the same mean. This is not an accident and follows from linearity of expectation.

To understand the mixture distribution, imagine that you had to go to a statistical consultant so that she could produce values from this distribution for you. She holds one realization of $X_1$ in one palm and one realization of $X_2$ in the other palm (though you don't know which of the two palms each is in). Now, her assistant flips a biased coin with probability $\alpha$ out of sight of you and then comes and whispers the result into the statistician's ear. She opens one of her palms and shows you the realization, but doesn't tell you the outcome of the coin flip. This process produces the mixture distribution.

On the other hand, the linear combination can be understood in the same context. The statistical consultant merely takes both realizations, multiplies the first by $\alpha$ and the second by $(1-\alpha)$, adds the result up and shows it to you.