Solved – the intuition behind the Population Stability Index

distributionsintuitionkullback-leiblerlikelihood

The "Population Stability Index" for two distributions $P$ and $Q$ is defined as the Symmetrised Kullback-Leibler divergence:

$$
\mathrm{PSI}(P,Q) = D_{KL}(P||Q) + D_{KL}(Q||P) = \sum_i(P_i-Q_i)\log\frac{P_i}{Q_i}
$$

What is the intuition behind this number?

One can always use the intuition for $D_{KL}$ and say that PSI is

the expected number of extra bits required to code samples from $P$ using a code optimized for $Q$ rather than the code optimized for $P$

plus the expected number of extra bits required to code samples from $Q$ using a code optimized for $P$ rather than the code optimized for $Q$,

but this is quite a mouthful.

Quora and UCAnalytics offer this "interpretation":

  • PSI < 0.1: Insignificant change (No action required)
  • 0.1 < PSI < 0.25: Some minor change (Start worrying)
  • 0.25 < PSI: Major shift in population (Need to delve deeper)

what is the basis for this?

Best Answer

Intuition

Kullback-Leibler Divergence can be interpreted to mean

how many bits of information we expect to lose is we use $Q$ instead of $P$.

Thus the Population Stability Index is the "roundtrip loss":

how many bits of information we expect to lose is we use $Q$ instead of $P$ and then use that again to go back to $Q$.

Values

It appears that the Population Stability Index is closely related to the G-test:

$$ \mathrm{PSI}(P,Q) = \frac{G(P,Q) + G(Q,P)}{2N} $$

(and thus can be computed using scipy.stats.power_divergence, as well as directly).

Therefore the p-values corresponding to PSI can be computed using the $\chi^2$ distribution:

import scipy.stats as st
print("            ","     ".join("DF=%d" % (df) for df in [1,2,3]))
for psi in [0.1, 0.25]:
    print "PSI=%.2f  %s" % (psi, "".join(
        " %5f" % (st.distributions.chi2.sf(psi,df)) for df in [1,2,3]))

               DF=1       DF=2       DF=3
PSI=0.10     0.751830   0.951229   0.991837
PSI=0.25     0.617075   0.882497   0.969140

Here PSI is the Population Stability Index and DF is the number of degrees of freedom ($\mathrm{DF}=n-1$ where $n$ is the number of distinct values that the variable takes).

Interestingly enough, the official "interpretation" of the PSI value completely ignores DF.