Confused by Kullback-Leibler on conditional probability distributions

bayesianinformation theorymachine learningvariational-analysis

I understand the Kullback-Leibler divergence well enough when it comes to a probability distribution over a single variable. However, I'm currently trying to teach myself variational methods and the use of the KL divergence in conditional probabilities is catching me out. The source I'm working from is here.

Specifically, the author represents the KL divergence as follows:

$$KL(Q_ϕ (Z|X)||P(Z|X)) = \sum_{z∈Z} q_ϕ (z|x) log\frac{q_ϕ (z|x)}{p(z|x)}$$

Where the confusion arises is on the summation across $Z$. Given that $z \in Z$ and $x \in X$, I would have expected (by analogy with conditional entropy) a double sum here of the form:

$$KL(Q_ϕ (Z|X)||P(Z|X)) = \sum_{z∈Z} \sum_{x∈X} q_ϕ (z|x) log\frac{q_ϕ (z|x)}{p(z|x)}$$

Otherwise, it seems to me that KL is only being calculated for one sample from $X$. Am I missing something basic here? And if my intuitions are off, any tips on getting them back on track would be useful––I'm teaching myself this stuff, so I don't have the benefit of formal instruction.

Best Answer

It depends on whether you are conditioning on a random variable or an event.

Given a random variable $x$,

$$ \operatorname{KL}[p(y \mid x) \,\|\, q(y \mid x)] \doteq \iint p(\bar{x},\bar{y}) \ln\frac{p(\bar{y} \mid \bar{x})}{q(\bar{y} \mid \bar{x})} \mathrm{d}\bar{x} \mathrm{d}\bar{y} \quad\text{or}\quad \sum_{\bar{x}}\sum_{\bar{y}} p(\bar{x},\bar{y}) \ln\frac{p(\bar{y} \mid \bar{x})}{q(\bar{y} \mid \bar{x})}. $$

Given an event $\bar{x}$,

$$ \operatorname{KL}[p(y \mid \bar{x}) \,\|\, q(y \mid \bar{x})] \doteq \int p(\bar{y}|\bar{x}) \ln\frac{p(\bar{y} \mid \bar{x})}{q(\bar{y} \mid \bar{x})} \mathrm{d}\bar{y} \quad\text{or}\quad \sum_{\bar{y}} p(\bar{y}|\bar{x}) \ln\frac{p(\bar{y} \mid \bar{x})}{q(\bar{y} \mid \bar{x})}. $$

Note how conditioning on an event is equivalent to changing the probability distribution over its variable to a point mass. This is what turns the joint into a conditional above,

$$ p'(x,y) \doteq p(y|x)\delta_{\bar{x}}(x)=p(y|\bar{x}). $$

To be more explicit, you can also choose instead of the KL conditioned on a random variable to use an expectation over event of the KL conditioned on those event,

$$ \operatorname{KL}[p(y \mid x) \,\|\, q(y \mid x)] =\operatorname{E}_{\bar{x}\sim p(x)}\big[ \operatorname{KL}[p(y \mid \bar{x}) \,\|\, q(y \mid \bar{x})] \big]. $$

Mixing up random variables and event is quite common but it's often easy to know from the context which is meant.

Related Question