[Math] Minimizing KL divergence: the asymmetry, when will the solution be the same

calculus-of-variationsit.information-theorypr.probabilityprobability distributionsst.statistics

The KL divergence between two distribution $p$ and $q$ is defined as
$$
D( q \| p)\int q(x)\log \frac{q(x)}{p(x)} dx
$$
and is known to be asymmetry: $D(q\|p)\neq D(p\|q)$.

If we fix $p$ and try to find a distribution $q$ among a class $E$ that minimize the KL distance, it is also known that minimizing $D( q \| p)$ will be different from minimizing $D( p \| q)$, e.g., https://benmoran.wordpress.com/2012/07/14/kullback-leibler-divergence-asymmetry/.

It is not clear which one to be optimized for a better approximation, although in many application we minimize $D(q\|p)$.

My question is that, when will the solution be the same

$$
\underset{q\in E}{\operatorname{argmin}} D(q\|p) = \underset{q\in E}{\operatorname{argmin}} D(p\|q)?
$$

For instance, if we take $E$ as the class of all Gaussian distribution, is there a condition on $p$ so that minimizing these two will lead to the same minimizer?

Best Answer

I don't have a definite answer, but here is something to continue with:

Formulate the optimization problems with constraints as $$ \mathrm{argmin}_{F(q)=0} D(q || p),\qquad \mathrm{argmin}_{F(q)=0} D(p||q) $$ and form the respective Lagrange functionals. Using that the derivatives of $D$ w.r.t. to the first and second components are, respectively, $$ \nabla_1 D(q||p) = \log(\tfrac{q}{p})+1\quad\text{and}\quad \nabla_2 D(p||q) = \tfrac{q}{p} $$ you see that necessary conditions for optima $q^*$ and $q^{**}$, respectively, are $$ \log(\tfrac{q^*}{p})+1 + \nabla F(q^*)\lambda = 0\quad\text{and}\quad \tfrac{q^{**}}{p} + \nabla F(q^{**})\lambda = 0. $$ I would not expect that $q^*$ and $q^{**}$ are equal for any non-trivial constraint…

On the positive side, $\nabla_1 D(q||p)$ and $\nabla_2 D(q||p)$ agree up to first order at $p=q$, i.e. $$\nabla_1 D(q||p) = \nabla_2 D(q||p) + \mathcal{O}(\tfrac{q}{p})$$.

Best Answer

Related Solutions

[Math] Kullback-Leibler divergence of scaled non-central Student’s T distribution

[Math] Divergence between two random variables

Related Question