Solved – Trying to implement the Jensen-Shannon Divergence for Multivariate Gaussians

entropyinformation theorymutual informationnormal distribution

Given two multivariate Gaussian distributions $P \equiv \mathcal{N}(\mu_p, \Sigma_p)$ and $Q \equiv \mathcal{N}(\mu_q, \Sigma_q)$, I am trying to calculate the Jensen-Shannon divergence between them.

I am following JSD-discussion for multivariate Gaussian in this discussion in this question. There it is suggested that one can approximate the midpoint measure $M$ using Monte Carlo sampling.

Specifically, it is pointed out that the JSD for continuous RVs (in my case Gaussian), is given by

$$
\mathrm{JSD} = \frac{1}{2} (D_{KL}(P\,\|M)+ D_{KL}(Q\|M)) = h(M) – \frac{1}{2} (h(P) + h(Q)) \>,
$$

where $h(P)$ and $h(Q)$ are just the differential entropies for the MVN. These properties are well known and we can calculate them easily, e.g.

$$
h(P) = \frac{1}{2} \log_2\big((2\pi e)^n |\Sigma_p|\big)
$$

What is causing me trouble is $M$. I believe I have misunderstood/not-implemented-correctly the Monte Carlo estimate for it.

User FrankD says that for the JSD approximation:
$$
JSD(P\|Q) = \frac{1}{2} (D_{KL}(P\|M)+ D_{KL}(Q\|M))
$$
we can use Monte Carlo estimates for the individual components. The Kullback-Leibler divergence is defined as:
$$
D_{KL}(P|M) = \int P(x) log\big(\frac{P(x)}{M(x)}\big) dx
$$
The Monte Carlo approximation of this is:
$$
D_{KL}^{approx}(P|M) = \frac{1}{n} \sum^n_i log\big(\frac{P(x_i)}{M(x_i)}\big)
$$

where the $x_i$ have been sampled from $P(x)$, which is easy as it is a Gaussian in our case. As $n \to \infty, D_{KL}^{approx}(P|M) \to KLD(P|M)$. $M(x_i)$ can be calculated as

$$
M(x_i) = \frac{1}{2}P(x_i) + \frac{1}{2}Q(x_i)
$$.

Here is my attempt:

import numpy as np
from scipy.stats import multivariate_normal as MVN

def jsd(mu_1: np.array, sigma_1: np.ndarray, mu_2: np.array, sigma_2: np.ndarray):
    """
    Monte carlo approximation to jensen shannon divergence for multivariate Gaussians.
    """
    assert mu_1.shape == mu_2.shape, "Shape mismatch."
    assert sigma_1.shape == sigma_2.shape, "Shape mismatch."

    # Monte Carlo samples
    MC_samples = 1000

    # Take MC samples
    P_samples = MVN.rvs(mean=mu_1, cov=sigma_1, size=MC_samples)
    Q_samples = MVN.rvs(mean=mu_2, cov=sigma_2, size=MC_samples)

    P = lambda x: MVN.pdf(x, mean=mu_1, cov=sigma_1)
    Q = lambda x: MVN.pdf(x, mean=mu_2, cov=sigma_2)
    M = lambda x: 0.5 * P(x) + 0.5 * Q(x)

    P_div_M = lambda x: P(x) / M(x)
    Q_div_M = lambda x: Q(x) / M(x)

    D_KL_approx_PM = lambda x: (1 / MC_samples) * sum(np.log2(P_div_M(x)))
    D_KL_approx_QM = lambda x: (1 / MC_samples) * sum(np.log2(Q_div_M(x)))

    return 0.5 * D_KL_approx_PM(P_samples) + 0.5 * D_KL_approx_QM(Q_samples)

Suffice to say, this does not quite produce what it should.

Best Answer

Actually, using the answer in https://stackoverflow.com/questions/26079881/kl-divergence-of-two-gmms (and the fact, that the author factored out the 1/2 from the logarithm, made the montecarlo approximation sample from both distributions to average the result), I would say, that the symmetrized numerical code for jensen shannon divergence using monte carlo integration, even for general scikit.stats distributions (_p and _q), should look like this:

def distributions_js(distribution_p, distribution_q, n_samples=10 ** 5):
    # jensen shannon divergence. (Jensen shannon distance is the square root of the divergence)
    # all the logarithms are defined as log2 (because of information entrophy)
    X = distribution_p.rvs(n_samples)
    p_X = distribution_p.pdf(X)
    q_X = distribution_q.pdf(X)
    log_mix_X = np.log2(p_X + q_X)

    Y = distribution_q.rvs(n_samples)
    p_Y = distribution_p.pdf(Y)
    q_Y = distribution_q.pdf(Y)
    log_mix_Y = np.log2(p_Y + q_Y)

    return (np.log2(p_X).mean() - (log_mix_X.mean() - np.log2(2))
            + np.log2(q_Y).mean() - (log_mix_Y.mean() - np.log2(2))) / 2

print("should be different:")
print(distributions_js(st.norm(loc=10000), st.norm(loc=0)))
print("should be same:")
print(distributions_js(st.norm(loc=0), st.norm(loc=0)))

For noncontinuous, change .pdf to probabilities of samples.

Related Solutions

Solved – Jensen-Shannon divergence for bivariate normal distributions

The midpoint measure $\newcommand{\bx}{\mathbf{x}} \newcommand{\KL}{\mathrm{KL}}M$ is a mixture distribution of the two multivariate normals, so it does not have the form that you give in the original post. Let $\varphi_p(\bx)$ be the probability density function of a $\mathcal{N}(\mu_p, \Sigma_p)$ random vector and $\varphi_q(\bx)$ be the pdf of $\mathcal{N}(\mu_q, \Sigma_q)$. Then the pdf of the midpoint measure is $$ \varphi_m(\bx) = \frac{1}{2} \varphi_p(\bx) + \frac{1}{2} \varphi_q(\bx) \> . $$

The Jensen-Shannon divergence is $$ \mathrm{JSD} = \frac{1}{2} (\KL(P\,\|M)+ \KL(Q\|M)) = h(M) - \frac{1}{2} (h(P) + h(Q)) \>, $$ where $h(P)$ denotes the (differential) entropy corresponding to the measure $P$.

Thus, your calculation reduces to calculating differential entropies. For the multivariate normal $\mathcal{N}(\mu, \Sigma)$, the answer is well-known to be $$ \frac{1}{2} \log_2\big((2\pi e)^n |\Sigma|\big) $$ and the proof can be found in any number of sources, e.g., Cover and Thomas (1991), pp. 230-231. It is worth pointing out that the entropy of a multivariate normal is invariant with respect to the mean, as the expression above shows. However, this almost assuredly does not carry over to the case of a mixture of normals. (Think about picking one broad normal centered at zero and another concentrated normal where the latter is pushed out far away from the origin.)

For the midpoint measure, things appear to be more complicated. That I know of, there is no closed-form expression for the differential entropy $h(M)$. Searching on Google yields a couple potential hits, but the top ones don't appear to give closed forms in the general case. You may be stuck with approximating this quantity in some way.

Note also that the paper you reference does not restrict the treatment to only discrete distributions. They treat a case general enough that your problem falls within their framework. See the middle of column two on page 1859. Here is where it is also shown that the divergence is bounded. This holds for the case of two general measures and is not restricted to the case of two discrete distributions.

The Jensen-Shannon Divergence has come up a couple of times recently in other questions on this site. See here and here.

Addendum: Note that a mixture of normals is not the same as a linear combination of normals. The simplest way to see this is to consider the one-dimensional case. Let $X_1 \sim \mathcal{N}(-\mu, 1)$ and $X_2 \sim \mathcal{N}(\mu, 1)$ and let them be independent of one another. Then a mixture of the two normals using weights $(\alpha, 1-\alpha)$ for $\alpha \in (0,1)$ has the distribution $$ \varphi_m(x) = \alpha \cdot \frac{1}{\sqrt{2\pi}} e^{-\frac{(x+\mu)^2}{2}} + (1-\alpha) \cdot \frac{1}{\sqrt{2\pi}} e^{-\frac{(x-\mu)^2}{2}} \> . $$

The distribution of a linear combination of $X_1$ and $X_2$ using the same weights as before is, via the stable property of the normal distribution is $$ \varphi_{\ell}(x) = \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{(x-(1-2\alpha)\mu)^2}{2\sigma^2}} \>, $$ where $\sigma^2 = \alpha^2 + (1-\alpha)^2$.

These two distributions are very different, though they have the same mean. This is not an accident and follows from linearity of expectation.

To understand the mixture distribution, imagine that you had to go to a statistical consultant so that she could produce values from this distribution for you. She holds one realization of $X_1$ in one palm and one realization of $X_2$ in the other palm (though you don't know which of the two palms each is in). Now, her assistant flips a biased coin with probability $\alpha$ out of sight of you and then comes and whispers the result into the statistician's ear. She opens one of her palms and shows you the realization, but doesn't tell you the outcome of the coin flip. This process produces the mixture distribution.

On the other hand, the linear combination can be understood in the same context. The statistical consultant merely takes both realizations, multiplies the first by $\alpha$ and the second by $(1-\alpha)$, adds the result up and shows it to you.

Solved – Correlated bivariate normal distribution: finding percentage of of data which is 2 standard deviations above the mean

If I have understood your question correctly, you want the CDF of the bivariate normal distribution. That is, for the standardized case:

$$ \Phi(\mathrm{\pmb{b}},\rho) = \frac{1}{2\pi\sqrt{1-\rho^{2}}}\int_{-\infty}^{b_{1}}{\int_{-\infty}^{b_{2}}}\exp\left[{-(x^{2}-2\rho xy+y^{2}})/(2(1-\rho^{2})\right]\mathrm{d}y\mathrm{d}x $$

This has no closed form solution and must be integrated numerically. With modern software, this is quite trivial.

Here is an example in R with perfectly correlated normal distributions (i.e. $\rho = 1$) with $\pmb{\mu}=(100, 40)^\intercal$ and covariance matrix $\Sigma = \begin{bmatrix} 225 & 75 \\ 75 & 25 \end{bmatrix}$. We calculate the probabiltiy of both variables being above $2$ SD:

library(mvtnorm)
library(MASS)

corr.mat <- matrix(c(1, 1, 1, 1), 2, 2, byrow = TRUE) # correlations
sd.mat <- matrix(c(15, 0, 0, 5), 2, 2, byrow = TRUE) # standard deviations

cov.mat <- sd.mat %*% corr.mat %*% sd.mat # covariance matrix

mu <- c(100, 40) # means

pmvnorm(lower = mu + 2*diag(sd.mat), upper = Inf, mean = mu, sigma = cov.mat)

[1] 0.02275013
attr(,"error")
[1] 2e-16
attr(,"msg")
[1] "Normal Completion"

As you rightly said: When they are perfectly correlated, the probability is about 2.3%.

What about a correlation of -0.21?

corr.mat <- matrix(c(1, -0.21, -0.21, 1), 2, 2, byrow = TRUE) # correlations
sd.mat <- matrix(c(15, 0, 0, 5), 2, 2, byrow = TRUE) # standard deviations

cov.mat <- sd.mat %*% corr.mat %*% sd.mat # covariance matrix

mu <- c(100, 40) # means

pmvnorm(lower = mu + 2*diag(sd.mat), upper = Inf, mean = mu, sigma = cov.mat)

[1] 0.0001228342
attr(,"error")
[1] 1e-15
attr(,"msg")
[1] "Normal Completion"

The probability is much lower, namely 0.0001228342.

We can verify our calculations by simulation. For the example above:

set.seed(21)
dat <- rmvnorm(1e7, mean = c(100, 40), sigma = cov.mat)

sum(dat[, 1] > (100+2*15) & dat[, 2] > (40+2*5))/dim(dat)[1]

[1] 0.0001261

This is very close to the result from numerical integration.

These calculations can easily be extended to multivariate normal distributions.

Best Answer

Related Solutions

Solved – Jensen-Shannon divergence for bivariate normal distributions

Solved – Correlated bivariate normal distribution: finding percentage of of data which is 2 standard deviations above the mean

Related Question