Solved – Is the square root of the symmetric Kullback-Leibler divergence a metric

kullback-leiblermetric

It is well known that the square root of the Jensen-Shannon divergence is a true metric, but how about the square root of symmetric KL: D(P||Q)+D(Q||P)? I have reasons to believe that it also is a true metric but cannot find any references on that other than anecdotal comments such as that it behaves more like a metric when used.

Update 1

Kullback-Leibler divergence: $D(P||Q) = \sum_i p_i\log(p_i/q_i)$

Jensen-Shannon divergence: $J(P,Q) = \big(D(P||(P+Q)/2)+D(Q||(P+Q)/2)\big)/2$

Symmetric KL divergence: $S(P,Q) = D(P||Q)+D(Q||P) = \sum_i (p_i-q_i)\log(p_i/q_i)$

Square root of symmetric KL: $d_{KL}(P,Q) = \sqrt{S(P,Q)}$

Is $d_{KL}$ a metric?

Update 2

I think the following upper and lower bounds hold:

$\sum_i (p_i-q_i)^2 \leq \sum_i (p_i-q_i)\log(p_i/q_i) \leq \sum_i \log(p_i/q_i)^2$

Both of the square root of the bounds are metrics, I suppose, since they are the square of the Euclidean distances in the probability space and the log-prob space respectively.

Best Answer

No, the square root of the symmetrised KL divergence is not a metric. A counterexample is as follows:

Let $P$ be a coin that produces a head 10% of the time.
Let $Q$ be a coin that produces a head 20% of the time.
Let $R$ be a coin that produces a head 30% of the time.
Then $d(P, Q) + d(Q, R) = 0.284... + 0.232... < 0.519... = d(P, R)$.

However, for $P$ and $Q$ very close together, $D(P, Q)$ and $J(P, Q)$ and $S(P, Q)$ are essentially the same (they are proportional to one another $+ O((P-Q)^3)$) and their square root is a metric (to the same order). We can take this local metric and integrate it up over the whole space of probability distributions to obtain a global metric. The result is:

$$A(P, Q) = \cos^{-1}\left(\sum_x \sqrt{P(x)Q(x)} \right)$$

I worked this out myself, so I'm afraid I do not know what it is called. I will use A for Alistair until I find out. ;-)

By construction, the triangle inequality in this metric is tight. You can actually find a unique shortest path through the space of probability distributions from $P$ to $Q$ that has the right length. In that respect it is preferable to the otherwise similar Hellinger distance:

$$H(P, Q) = 1 - \sqrt{\sum_x \sqrt{P(x)*Q(x)} }$$

Update 2013-12-05: Apparently this is called the Battacharrya arc-cos distance.

Intuition

Kullback-Leibler Divergence can be interpreted to mean

how many bits of information we expect to lose is we use $Q$ instead of $P$.

Thus the Population Stability Index is the "roundtrip loss":

how many bits of information we expect to lose is we use $Q$ instead of $P$ and then use that again to go back to $Q$.

Values

It appears that the Population Stability Index is closely related to the G-test:

$$ \mathrm{PSI}(P,Q) = \frac{G(P,Q) + G(Q,P)}{2N} $$

(and thus can be computed using scipy.stats.power_divergence, as well as directly).

Therefore the p-values corresponding to PSI can be computed using the $\chi^2$ distribution:

import scipy.stats as st
print("            ","     ".join("DF=%d" % (df) for df in [1,2,3]))
for psi in [0.1, 0.25]:
    print "PSI=%.2f  %s" % (psi, "".join(
        " %5f" % (st.distributions.chi2.sf(psi,df)) for df in [1,2,3]))

               DF=1       DF=2       DF=3
PSI=0.10     0.751830   0.951229   0.991837
PSI=0.25     0.617075   0.882497   0.969140

Here PSI is the Population Stability Index and DF is the number of degrees of freedom ($\mathrm{DF}=n-1$ where $n$ is the number of distinct values that the variable takes).

Interestingly enough, the official "interpretation" of the PSI value completely ignores DF.

Solved – Why Kullback-Leibler in Stochastic Neighbor Embedding

Dimensionality reduction techniques are often motivated by finding new representations of the data to discover hidden variables or to discover structure. The aim of SNE is to take a different approach (compared to PCA for example) by preserving local structures, which is done by taking advantage of KL-divergence's asymmetric properties.

Conditional probabilities as inverse distance

Looking at Eq (1) ,notice that the conditional probability can be interpreted as "inverse distance", because close points (low distance) are assigned high probabilities, and far points (high distance) are assigned low probabilities.

(Note: The inverse distance name is obviously not true in a stricter mathematical sense, because effectively a larger set of numbers $ \mathbb{R} $ are mapped to a smaller set of numbers $ [0,1] $.)

Taking advantage of assymetry in KL

Two scenarios exhibit differences compared to a symmetric cost function in Equation (2).

$ p_{i|j} >> q_{i|j}$ Points that are close in high dimensional space and far in low dimensional space are penalised heavily. This is important, because this promotes the preservation of local structures
$ q_{i|j} >> p_{i|j}$ Points that are far in high dimension space and close in low dimensional space are penalised less heavily. This is okay for us.

Thus, assymetric property of KL-divergence, and the definition of the conditional probability constitutes as the key idea of this dimensionality reduction technique. Below, you can see this is exactly why the other distances fail to be a good substitute.

So then, what is the problem with the other distance metrics?

The Jensen-Shannon Divergence is effectively the symmetrisation of the KL-Divergence, by

$$ JSD(P_i||Q_i) = \frac{1}{2}KL(P_i||Q_i) + \frac{1}{2} KL(Q_i || P_i) .$$

This loses exactly the property of preserving local structures, so this is not a good substitute.

The Wasserstein distance can intuitively seen as the rearranging of a histogram from one state to another state. The rearrangements are the same both ways, so the Wasserstein metric is also symmetric, and does not have this desirable property.

The Kolmogorov-Smirnov distance is nonparametric, which would imply that we don't assume a probability distribution, but in fact the structure is described in Eq (1).

Best Answer

Related Solutions

Solved – the intuition behind the Population Stability Index

Intuition

Values

Solved – Why Kullback-Leibler in Stochastic Neighbor Embedding

Related Question