The Bhattacharyya distance is defined as $D_B(p,q) = -\ln \left( BC(p,q) \right)$, where $BC(p,q) = \sum_{x\in X} \sqrt{p(x) q(x)}$ for discrete variables and similarly for continuous random variables. I'm trying to gain some intuition as to what this metric tells you about the 2 probability distributions and when it might be a better choice than KL-divergence, or Wasserstein distance. (Note: I am aware that KL-divergence is not a distance).
Solved – Intuition of the Bhattacharya Coefficient and the Bhattacharya distance
bhattacharyyadistancedistance-functionsintuitionmathematical-statistics
Related Solutions
1. If every exponential family distribution has a unique Bregman Divergence, then is that the optimal distance (divergence) metric to use specific to that distribution? (e.g. use Logistic Loss for Bernoulli)
It depends whether duality is so important to you. See [Amari], and in fact $\alpha$-divergence is almost always superior to Bregman divergence.
2. If yes to #1 above, why is KL-divergence used so often comparing two distributions when it is unique only to multinomials? (comparisons even within the exponential family)
I am not sure what you are asking about. But [Amari] also said that it corresponds to $L^2$-distance which is so commonly used in statistics. It is generally very wrong to used divergence to specific two different distributions, even distance might not be a good choice to specify distributions. Many machine learning models are only theoretically sound, for example the popular Wasserstain-GAN model only decides a probability distribution up to a diffeomorphism.
3. Is there a theoretical justification to using KL-divergence between those distributions although they may have a different Bregman-divergence, and KL is unique to just multinomials? It seems that if #1 is true, then the optimal divergence for Exponential would be Itakura-Saito distance, etc.
Optimal in what sense? It is only a distance that might identify different members ...
4. If #1 is is false, then when is it proper to use the Bregman divergence of that distribution compared to KL-divergence (or others) which may appear comparing two members of the same distribution? Does KL have a higher theoretical justification of use across distributions, although it is a special case of Bregman-divergence unique to multinomials?
Another possible reason is that multinomial is a primitive model for density estimation (realistic empirical models are always finitely supported), so KL divergence can be useful when we are only comparing empirical models.
[Amari] Amari, Shun-ichi. "Divergence function, information monotonicity and information geometry." Workshop on Information Theoretic Methods in Science and Engineering (WITMSE). 2009.
The triangle inequality would be that $$D_B(p,q)\leq D_B(p,r)+D_B(r,q)$$ for all probability distributions $p,q,r$. So to show that the inequality does not hold, it is sufficient to find one counterexample.
One such counterexample is given by the following simple Bernoulli distributions: $$ p=(0.1,0.9), \quad q=(0.9,0.1), \quad r=(0.5,0.5). $$ Then $$D_B(p,q) = -\ln(2\sqrt{0.09}) \approx 0.51$$ but $$D_B(p,r)=D_B(r,q)=-\ln(\sqrt{0.05}+\sqrt{0.45})\approx 0.11. $$
In general, I hack together a simple R script when searching for such counterexamples. (Or when I have a hunch and want to test it before thinking deeply about it. "Computers are cheap, and thinking hurts.") In the present case, a script like the following quickly points us in the right direction:
nn <- 2
normalize <- function(xx) xx/sum(xx)
DB <- function(pp,qq) -log(sum(sqrt(pp*qq)))
while ( TRUE ) {
pp <- normalize(runif(nn))
qq <- normalize(runif(nn))
rr <- normalize(runif(nn))
if ( DB(pp,rr) > DB(pp,qq)+DB(qq,rr) ) {
cat(pp,"\n",qq,"\n",rr,"\n")
break
}
}
Best Answer
The Bhattacharyya coefficient is $$ BC(h,g)= \int \sqrt{h(x) g(x)}\; dx $$ in the continuous case. There is a good wikipedia article https://en.wikipedia.org/wiki/Bhattacharyya_distance. How to understand this (and the related distance)? Let us start with the multivariate normal case, which is instructive and can be found at the link above. When the two multivariate normal distributions have the same covariance matrix, the Bhattacharyya distance coincides with the Mahalanobis distance, while in the case of two different covariance matrices it does have a second term, and so generalizes the Mahalanobis distance. This maybe underlies claims that in some cases the Bhattacharyya distance works better than the Mahalanobis. The Bhattacharyya distance is also closely related to the Hellinger distance https://en.wikipedia.org/wiki/Hellinger_distance.
Working with the formula above, we can find some stochastic interpretation. Write $$ \DeclareMathOperator{\E}{\mathbb{E}} BC(h,g) = \int \sqrt{h(x) g(x)}\; dx = \\ \int h(x) \cdot \sqrt{\frac{g(x)}{h(x)}}\; dx = \E_h \sqrt{\frac{g(X)}{h(X)}} $$ so it is the expected value of the square root of the likelihood ratio statistic, calculated under the distribution $h$ (the null distribution of $X$). That makes for comparisons with Intuition on the Kullback-Leibler (KL) Divergence, which interprets Kullback-Leibler divergence as expectation of the loglikelihood ratio statistic (but calculated under the alternative $g$). Such a viewpoint might be interesting in some applications.
Still another viewpoint, compare with the general family of f-divergencies, defined as, see Rényi entropy $$ D_f(h,g) = \int h(x) f\left( \frac{g(x)}{h(x)}\right)\; dx $$ If we choose $f(t)= 4( \frac{1+t}{2}-\sqrt{t} )$ the resulting f-divergence is the Hellinger divergence, from which we can calculate the Bhattacharyya coefficient. This can also be seen as an example of a Renyi divergence, obtained from a Renyi entropy, see link above.