I am trying to understand the explanation of the KL divergence per below. It refers, as i understand it, to an expectation in the second term. "Approximating the expectation over q in this term". However, we are multiplying q(x) with the log of p(x) (rather than with p(x). Is it still correct to refer to this construct as an expected value? please let me know.

# Solved – KL divergence and expectations

expected valuekullback-leibler

#### Related Solutions

The KL divergence is an *asymmetric* measure. It can be symmetrized for two distributions $F$ and $G$ by averaging or summing it, as in

$$KLD(F,G) = KL(F,G) + KL(G,F).$$

Because the formula quoted in the question clearly is symmetric, we might hypothesize that it is such a symmetrized version. Let's check:

$$ \left(\log \frac{\sigma_2}{\sigma_1} + \frac{\sigma_1^2 + (\mu_1 - \mu_2)^2}{2 \sigma_2^2} - \frac{1}{2}\right) + \left(\log \frac{\sigma_1}{\sigma_2} + \frac{\sigma_2^2 + (\mu_2 - \mu_1)^2}{2 \sigma_1^2} - \frac{1}{2}\right)$$

$$=\left(\log(\sigma_2/\sigma_1) + \log(\sigma_1/\sigma_2)\right) + \frac{\sigma_1^2}{2\sigma_2^2} + \frac{\sigma_2^2}{2\sigma_1^2} + \left(\mu_1-\mu_2\right)^2\left(\frac{1}{2\sigma_2^2}+\frac{1}{2\sigma_1^2}\right) - 1$$

The logarithms obviously cancel, which is encouraging. The appearance of the factor $\left(\frac{1}{2\sigma_2^2}+\frac{1}{2\sigma_1^2}\right)$ multiplying the term with the means motivates us to introduce a similar sum of fractions in the first part of the expression, too, so we do so perforce, compensating for the two new terms (which are both equal to $1/2$) by subtracting them again and then collecting all multiples of this factor:

$$= 0 + \frac{\sigma_1^2}{2\sigma_2^2}+\left(\frac{\sigma_1^2}{2\sigma_1^2}-\frac{1}{2}\right) + \frac{\sigma_2^2}{2\sigma_1^2} + \left(\frac{\sigma_2^2}{2\sigma_2^2} - \frac{1}{2} \right)+ \cdots$$

$$= \left(\sigma_1^2 + \sigma_2^2\right)\left(\frac{1}{2\sigma_1^2} + \frac{1}{2\sigma_2^2}\right) - 1 + \left(\mu_1-\mu_2\right)^2\left(\frac{1}{2\sigma_2^2}+\frac{1}{2\sigma_1^2}\right) - 1$$

$$ = \frac{1}{2}\left(\left(\mu_1-\mu_2\right)^2 + \left(\sigma_1^2 + \sigma_2^2\right)\right)\left(\frac{1}{\sigma_2^2}+\frac{1}{\sigma_1^2}\right) - 2.$$

That's precisely the value found in the reference: it is the *sum* of the two KL divergences, also known as the symmetrized divergence.

There is a purely statistical approach to Kullback-Leibler divergence: take a sample $X_1,\ldots,X_n$ iid from an unknown distribution $p^\star$ and consider the potential fit by a family of distributions, $$\mathfrak{F}=\{p_\theta\,,\ \theta\in\Theta\}$$The corresponding likelihood is defined as $$L(\theta|x_1,\ldots,x_n)=\prod_{i=1}^n p_\theta(x_i)$$ and its logarithm is $$\ell(\theta|x_1,\ldots,x_n)=\sum_{i=1}^n \log p_\theta(x_i)$$ Therefore, $$\frac{1}{n} \ell(\theta|x_1,\ldots,x_n) \longrightarrow \mathbb{E}[\log p_\theta(X)]=\int \log p_\theta(x)\,p^\star(x)\text{d}x$$ which is the interesting part of the Kullback-Leibler divergence between $p_\theta$ and $p^\star$ $$\mathfrak{H}(p_\theta|p^\star)\stackrel{\text{def}}{=}\int \log \{p^\star(x)/p_\theta(x)\}\,p^\star(x)\text{d}x$$the other part$$\int \log \{p^\star(x)\}\,p^\star(x)\text{d}x$$being there to have the minimum [in $\theta$] of $\mathfrak{H}(p_\theta|p^\star)$ equal to zero.

A book that connects divergence, information theory and statistical inference is Rissanen's Optimal estimation of parameters, which I reviewed here.

## Best Answer

Expected value is a quantity that can be computed for

any functionof the outcomes.Let $\Omega$ be the space of all possible outcomes and let $q:\Omega \rightarrow \mathbb{R}$ be a probability distribution defined on $\Omega$. For

any function$f:\Omega \rightarrow S$ where $S$ is an arbitrary set that is closed under addition and scalar multiplication (e.g. $S = \mathbb{R}$) we can compute the expected value of $f$ under distribution $q$ as follows: $$ \mathbb{E}[f] = \mathbb{E}_{x \sim q}[f(x)] = \sum_{x \in \Omega} q(x) f(x) $$In the KL-divergence, we have that $f(x) = \ln{\frac{q(x)}{p(x)}}$ for some fixed $p(x)$.