Solved – Jensen-Shannon divergence for finite samples

binningdistanceentropyfinite-populationinformation theory

I have two finite samples $s_1$ and $s_2$ and two distributions $p_1(s_1)$ and $p_2(s_2)$ that are associated to these samples. I'm essentially interested to measure the distance or similarity between these two distributions. I'm currently using the Jensen-Shannon (JS) distance, which involves in calculating the entropies of the two distributions separately and jointly.
I need to bin the data for calculating the basic entropy functions and it looks like the resulted JS distance is also a function of the data binning strategy that I use. It sounds quite arbitrary and I don't know how much I can trust the results.
Is there any way around this? How can I measure the distance between $p_1(s_1)$ and $p_2(s_2)$ in a more rigorous way that needs no data binning or any other sort of arbitrariness?

Best Answer

If your data are taken from a setting where it is useful or realistic to consider that your measurements are "noisy", it may be useful to pick a prior noise model for your data which means that both your distributions have support over the same space.

A simple approach would be to consider that your distributions are Gaussian mixtures with means at the observed data points and a single common variance. This "smoothing" gives your distribution support on the whole real line, for 1-dimensional data. It then remains to calculate Kullback-Leibler divergences between the two Gaussian mixtures: there is no closed form solution for this, but you can compute it numerically using the approaches in Hershey and Olsen (1), e.g. by Monte Carlo sampling. For higher-dimensional data you could use multivariate Gaussian mixtures (with full rank covariance matrices, for the reason given by Matus Telgarsky in the MathOverflow discussion 2)

Of course this has not removed all arbitrariness - we still have the common prior variance (and the choice of noise model). But it should be easy to study numerically the behaviour of the Jensen-Shannon divergence as you vary the single parameter of the noise model.

Related Solutions

Solved – Jensen-Shannon divergence for bivariate normal distributions

The midpoint measure $\newcommand{\bx}{\mathbf{x}} \newcommand{\KL}{\mathrm{KL}}M$ is a mixture distribution of the two multivariate normals, so it does not have the form that you give in the original post. Let $\varphi_p(\bx)$ be the probability density function of a $\mathcal{N}(\mu_p, \Sigma_p)$ random vector and $\varphi_q(\bx)$ be the pdf of $\mathcal{N}(\mu_q, \Sigma_q)$. Then the pdf of the midpoint measure is $$ \varphi_m(\bx) = \frac{1}{2} \varphi_p(\bx) + \frac{1}{2} \varphi_q(\bx) \> . $$

The Jensen-Shannon divergence is $$ \mathrm{JSD} = \frac{1}{2} (\KL(P\,\|M)+ \KL(Q\|M)) = h(M) - \frac{1}{2} (h(P) + h(Q)) \>, $$ where $h(P)$ denotes the (differential) entropy corresponding to the measure $P$.

Thus, your calculation reduces to calculating differential entropies. For the multivariate normal $\mathcal{N}(\mu, \Sigma)$, the answer is well-known to be $$ \frac{1}{2} \log_2\big((2\pi e)^n |\Sigma|\big) $$ and the proof can be found in any number of sources, e.g., Cover and Thomas (1991), pp. 230-231. It is worth pointing out that the entropy of a multivariate normal is invariant with respect to the mean, as the expression above shows. However, this almost assuredly does not carry over to the case of a mixture of normals. (Think about picking one broad normal centered at zero and another concentrated normal where the latter is pushed out far away from the origin.)

For the midpoint measure, things appear to be more complicated. That I know of, there is no closed-form expression for the differential entropy $h(M)$. Searching on Google yields a couple potential hits, but the top ones don't appear to give closed forms in the general case. You may be stuck with approximating this quantity in some way.

Note also that the paper you reference does not restrict the treatment to only discrete distributions. They treat a case general enough that your problem falls within their framework. See the middle of column two on page 1859. Here is where it is also shown that the divergence is bounded. This holds for the case of two general measures and is not restricted to the case of two discrete distributions.

The Jensen-Shannon Divergence has come up a couple of times recently in other questions on this site. See here and here.

Addendum: Note that a mixture of normals is not the same as a linear combination of normals. The simplest way to see this is to consider the one-dimensional case. Let $X_1 \sim \mathcal{N}(-\mu, 1)$ and $X_2 \sim \mathcal{N}(\mu, 1)$ and let them be independent of one another. Then a mixture of the two normals using weights $(\alpha, 1-\alpha)$ for $\alpha \in (0,1)$ has the distribution $$ \varphi_m(x) = \alpha \cdot \frac{1}{\sqrt{2\pi}} e^{-\frac{(x+\mu)^2}{2}} + (1-\alpha) \cdot \frac{1}{\sqrt{2\pi}} e^{-\frac{(x-\mu)^2}{2}} \> . $$

The distribution of a linear combination of $X_1$ and $X_2$ using the same weights as before is, via the stable property of the normal distribution is $$ \varphi_{\ell}(x) = \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{(x-(1-2\alpha)\mu)^2}{2\sigma^2}} \>, $$ where $\sigma^2 = \alpha^2 + (1-\alpha)^2$.

These two distributions are very different, though they have the same mean. This is not an accident and follows from linearity of expectation.

To understand the mixture distribution, imagine that you had to go to a statistical consultant so that she could produce values from this distribution for you. She holds one realization of $X_1$ in one palm and one realization of $X_2$ in the other palm (though you don't know which of the two palms each is in). Now, her assistant flips a biased coin with probability $\alpha$ out of sight of you and then comes and whispers the result into the statistician's ear. She opens one of her palms and shows you the realization, but doesn't tell you the outcome of the coin flip. This process produces the mixture distribution.

On the other hand, the linear combination can be understood in the same context. The statistical consultant merely takes both realizations, multiplies the first by $\alpha$ and the second by $(1-\alpha)$, adds the result up and shows it to you.

Solved – Trying to implement the Jensen-Shannon Divergence for Multivariate Gaussians

Actually, using the answer in https://stackoverflow.com/questions/26079881/kl-divergence-of-two-gmms (and the fact, that the author factored out the 1/2 from the logarithm, made the montecarlo approximation sample from both distributions to average the result), I would say, that the symmetrized numerical code for jensen shannon divergence using monte carlo integration, even for general scikit.stats distributions (_p and _q), should look like this:

def distributions_js(distribution_p, distribution_q, n_samples=10 ** 5):
    # jensen shannon divergence. (Jensen shannon distance is the square root of the divergence)
    # all the logarithms are defined as log2 (because of information entrophy)
    X = distribution_p.rvs(n_samples)
    p_X = distribution_p.pdf(X)
    q_X = distribution_q.pdf(X)
    log_mix_X = np.log2(p_X + q_X)

    Y = distribution_q.rvs(n_samples)
    p_Y = distribution_p.pdf(Y)
    q_Y = distribution_q.pdf(Y)
    log_mix_Y = np.log2(p_Y + q_Y)

    return (np.log2(p_X).mean() - (log_mix_X.mean() - np.log2(2))
            + np.log2(q_Y).mean() - (log_mix_Y.mean() - np.log2(2))) / 2

print("should be different:")
print(distributions_js(st.norm(loc=10000), st.norm(loc=0)))
print("should be same:")
print(distributions_js(st.norm(loc=0), st.norm(loc=0)))

For noncontinuous, change .pdf to probabilities of samples.

Best Answer

Related Solutions

Solved – Jensen-Shannon divergence for bivariate normal distributions

Solved – Trying to implement the Jensen-Shannon Divergence for Multivariate Gaussians

Related Question