[Math] Does it make sense to calculate the KL-divergence between a joint distribution and a marginal distribution

entropyinformation theoryprobability distributions

The KL-divergence is defined as:

$D_{KL} (p(x) \parallel q(x)) = \sum_x p(x) \log \frac{p(x)}{q(x)} $

If $A$ and $B$ are discrete variables, does it make sense to calculate $D_{KL}(p(A, B) \parallel p(A))$? Namely, the divergence between the joint distribution $p(A, B)$, and the marginal distribution $p(A)$. Or must they be of the same type (i.e. defined over the same and all variables)?

I tried, and I found this:

$
\begin{align}
D_{KL}(p(A, B) \parallel p(A)) &= \sum_{a \in A} \sum_{b \in B} p(a, b) \log \frac {p(a, b)}{p(a)} \\
&= ~… \\
&= H\,(B \mid A)
\end{align}
$

Here, $H$ is the Shannon entropy.

Now, if needed, we could make the second distribution to be the same type, by defining $D_{KL}(p(A, B) \parallel q(A, B))$, where we simply say that $q(A, B) = Pr(A)$. Would it then be okay to calculate $D_{KL}(p(A, B) \parallel q(A))$?

In the book "Elements of Information Theory", by Cover and Thomas, it says that $D_{KL}(p(x) \parallel q(x)) = \infty $ if the distribution $q$ doesn't define a probability value for every symbol that $p$ defines.

Best Answer

I seems to me that you have already answered your question. Namely, $D_{KL}(p(A, B) \parallel p(A)) = H\,(B \mid A) $


Update: the above is not right. The definition of $D_{KL}$ require two valid probability functions defined in the same space. It's true that we could consider $p(A)$ as a function of two variables, only that it's constant on the second variable - as you wrote: $q(A,B)=P(A)$ . But then the sum over the two variables would not (in general) sum up to one, hence $P(A)$ would not be a valid joint probability function.

Hence, no, the conditional entropy cannot be written a $KL-$divergence.


In the book "Elements of Information Theory", by Cover and Thomas, it says that $D_{KL}(p(x) \parallel q(x)) = \infty $ if the distribution $q$ doesn't define a probability value for every symbol that $p$ defines.

That's true, but that's inconsequential here. That means that $D_{KL}(p(x) \parallel q(x)) = \infty $ if $q(x)=0$ for some value $x$ ("symbol") such that $p(x)>0$. But that's not you case. BEcause, for any given pair $(a,b)$ with $p(a,b)>0$ you have that $q(a,b) \triangleq p(a) = \sum_{b'} p(a,b')>0$. So, no problem.