Solved – KL Divergence, Bregman, and uniqueness

distanceexponential-familyinformation-geometrykullback-leibler

While reading the following paper on Bregman Divergence (link)

Banerjee, Arindam, et al. "Clustering with Bregman divergences." Journal of machine learning research 6.Oct (2005): 1705-1749.

In section 4 (pg 1720) the authors mention

It has been observed in the literature that exponential families and
Bregman divergences have a close relationship that can be exploited
for several learning problems. In particular, Forster and Warmuth
(2000)[Section 5.1] remarked that the log-likelihood of the density of
an exponential family distribution $p(ψ,θ)$ can be written as the sum of
the negative of a uniquely determined Bregman divergence $d_φ(x,µ)$ and a
function that does not depend on the distribution parameters.

They later proved in Theorem 4 (pg 1721) that it is unique and one-to-one.

"From Theorem 4 we note that every regular exponential family
corresponds to a unique and distinct Bregman divergence (one-to-one
mapping)" (pg 1722)

Table 2 lists some distributions within the exponential family along with their unique Bregman divergence. A select portion of that table is copied below (names of divergences taken from Table 1)

$$
\begin{array}{l|l}
\text{Distribution} & d_φ(x,µ) & \text{name} \\
\hline
\text{1-D Gaussian} & \frac{1}{2 \sigma^2} {(x-\mu)}^2 & \text{Squared Loss}\\
\text{1-D Poisson} & x \log \left( \frac{x}{\mu} \right) – (x-\mu) \\
\text{1-D Bernoulli} & x \log \left( \frac{x}{\mu} \right) – (1-x) \log \left( \frac{1-x}{1-\mu} \right) & \text{Logistic Loss}\\
\text{1-D Binomial} & x \log \left( \frac{x}{\mu} \right) – (N-x) \log \left( \frac{N-x}{N-\mu} \right) \\
\text{1-D Exponential} & \frac{x}{\mu} – \log \left( \frac{x}{\mu} \right) – 1 & \text{Itakura-Saito distance} \\
\text{d-D Sph Gaussian} & \frac{1}{2 \sigma^2} {(x-\mu)}^2 \\
\text{d-D Multinomial} & \sum_{j=1}^d x_j \log \left( \frac{x_j}{\mu_j} \right) & \text{KL-divergence}\\
\end{array}
$$

Questions:

If every exponential family distribution has a unique Bregman Divergence, then is that the optimal distance (divergence) metric to use specific to that distribution? (e.g. use Logistic Loss for Bernoulli)
If yes to #1 above, why is KL-divergence used so often comparing two distributions when it is unique only to multinomials? (comparisons even within the exponential family)

For example, Wikipedia lists the KL divergence between two members of the same distribution, even within the exponential family distribution

d-D Gaussian link
1-D Poisson link
1-D Exponential link
(ironically KL not included in the multinomial article)
1. Is there a theoretical justification to using KL-divergence between those distributions although they may have a different Bregman-divergence, and KL is unique to just multinomials? It seems that if #1 is true, then the optimal divergence for Exponential would be Itakura-Saito distance, etc.
2. If #1 is is false, then when is it proper to use the Bregman divergence of that distribution compared to KL-divergence (or others) which may appear comparing two members of the same distribution? Does KL have a higher theoretical justification of use across distributions, although it is a special case of Bregman-divergence unique to multinomials?

Best Answer

1. If every exponential family distribution has a unique Bregman Divergence, then is that the optimal distance (divergence) metric to use specific to that distribution? (e.g. use Logistic Loss for Bernoulli)

It depends whether duality is so important to you. See [Amari], and in fact $\alpha$-divergence is almost always superior to Bregman divergence.

2. If yes to #1 above, why is KL-divergence used so often comparing two distributions when it is unique only to multinomials? (comparisons even within the exponential family)

I am not sure what you are asking about. But [Amari] also said that it corresponds to $L^2$-distance which is so commonly used in statistics. It is generally very wrong to used divergence to specific two different distributions, even distance might not be a good choice to specify distributions. Many machine learning models are only theoretically sound, for example the popular Wasserstain-GAN model only decides a probability distribution up to a diffeomorphism.

3. Is there a theoretical justification to using KL-divergence between those distributions although they may have a different Bregman-divergence, and KL is unique to just multinomials? It seems that if #1 is true, then the optimal divergence for Exponential would be Itakura-Saito distance, etc.

Optimal in what sense? It is only a distance that might identify different members ...

4. If #1 is is false, then when is it proper to use the Bregman divergence of that distribution compared to KL-divergence (or others) which may appear comparing two members of the same distribution? Does KL have a higher theoretical justification of use across distributions, although it is a special case of Bregman-divergence unique to multinomials?

Another possible reason is that multinomial is a primitive model for density estimation (realistic empirical models are always finitely supported), so KL divergence can be useful when we are only comparing empirical models.

[Amari] Amari, Shun-ichi. "Divergence function, information monotonicity and information geometry." Workshop on Information Theoretic Methods in Science and Engineering (WITMSE). 2009.

Implementation Details

Gamma functions grow rapidly, so to avoid overflow don't compute Gamma and take its logarithm: instead use the log-Gamma function that will be found in any statistical computing platform (including Excel, for that matter).

The ratio $\Gamma^\prime(d)/\Gamma(d)$ is the logarithmic derivative of $\Gamma,$ generally called $\psi,$ the digamma function. If it's not available to you, there are relatively simple ways to approximate it, as described in the Wikipedia article.

Here, to illustrate, is a direct R implementation of the formula in terms of $I$. This does not exploit an opportunity to simplify the result algebraically, which would make it a little more efficient (by eliminating a redundant calculation of $\psi$).

#
# `b` and `d` are Gamma shape parameters and
# `a` and `c` are scale parameters.
# (All, therefore, must be positive.)
#
KL.gamma <- function(a,b,c,d) {
  i <- function(a,b,c,d)
    - c * d / a - b * log(a) - lgamma(b) + (b-1)*(psigamma(d) + log(c))
  i(c,d,c,d) - i(a,b,c,d)
}
print(KL.gamma(1/114186.3, 202, 1/119237.3, 195), digits=12)

Solved – software library to compute KL divergence

I ended up coding KL divergences and derivatives myself in Julia. I've released it as part of an existing open source project. Future readers may find the code at this page of the Celeste.jl project.

Best Answer

Related Solutions

Solved – Kullback–Leibler divergence between two gamma distributions

Implementation Details

Solved – software library to compute KL divergence

Related Question