No, the square root of the symmetrised KL divergence is not a metric. A counterexample is as follows:
- Let $P$ be a coin that produces a head 10% of the time.
- Let $Q$ be a coin that produces a head 20% of the time.
- Let $R$ be a coin that produces a head 30% of the time.
- Then $d(P, Q) + d(Q, R) = 0.284... + 0.232... < 0.519... = d(P, R)$.
However, for $P$ and $Q$ very close together, $D(P, Q)$ and $J(P, Q)$ and $S(P, Q)$ are essentially the same (they are proportional to one another $+ O((P-Q)^3)$) and their square root is a metric (to the same order). We can take this local metric and integrate it up over the whole space of probability distributions to obtain a global metric. The result is:
$$A(P, Q) = \cos^{-1}\left(\sum_x \sqrt{P(x)Q(x)} \right)$$
I worked this out myself, so I'm afraid I do not know what it is called. I will use A for Alistair until I find out. ;-)
By construction, the triangle inequality in this metric is tight. You can actually find a unique shortest path through the space of probability distributions from $P$ to $Q$ that has the right length. In that respect it is preferable to the otherwise similar Hellinger distance:
$$H(P, Q) = 1 - \sqrt{\sum_x \sqrt{P(x)*Q(x)} }$$
Update 2013-12-05: Apparently this is called the Battacharrya arc-cos distance.
Intuition
Kullback-Leibler Divergence can be interpreted to mean
how many bits of information we expect to lose is we use $Q$ instead of $P$.
Thus the Population Stability Index is the "roundtrip loss":
how many bits of information we expect to lose is we use $Q$ instead of $P$ and then use that again to go back to $Q$.
Values
It appears that the Population Stability Index is closely related to the G-test:
$$
\mathrm{PSI}(P,Q) = \frac{G(P,Q) + G(Q,P)}{2N}
$$
(and thus can be computed using scipy.stats.power_divergence
, as well as directly).
Therefore the p-values corresponding to PSI can be computed using the $\chi^2$ distribution:
import scipy.stats as st
print(" "," ".join("DF=%d" % (df) for df in [1,2,3]))
for psi in [0.1, 0.25]:
print "PSI=%.2f %s" % (psi, "".join(
" %5f" % (st.distributions.chi2.sf(psi,df)) for df in [1,2,3]))
DF=1 DF=2 DF=3
PSI=0.10 0.751830 0.951229 0.991837
PSI=0.25 0.617075 0.882497 0.969140
Here PSI
is the Population Stability Index and DF
is the number of degrees of freedom ($\mathrm{DF}=n-1$ where $n$ is the number of distinct values that the variable takes).
Interestingly enough, the official "interpretation" of the PSI
value completely ignores DF
.
Best Answer
Set aside Kullback-Leibler divergence for a moment and consider the following: it's perfectly possible for the Kolmogorov-Smirnov p-value to be small and for the corresponding Kolomogorov-Smirnov distance to be small.
Specifically, that can easily happen with large sample sizes, where even small differences are still larger than we'd expect to see from random variation.
The same will naturally tend to happen when considering some other suitable measure of divergence and comparing it to the Kolmogorov-Smirnov p-value - it will quite naturally occur at large sample sizes.
[If you don't wish to confound the distinction between Kolmogorov-Smirnov distance and p-value with the difference in what the two things are looking at, it might be better to explore the differences in the two measures ($D_{KS}$ and $D_{KL}$) directly, but that's not what is being asked here.]