Mutual Information – Why Is Mutual Information Symmetric? An Information Theory Perspective

information theorykullback-leiblermutual information

I know that mutual information is the Kullback-Leibler divergence between $p(x,y)$ and $p(x)p(y)$. But mutual information is also described as the amount of entropy lost (or, in another sense, the information gained) about $X$ by virtue of knowing about $Y$.

Without just looking at the formula, it's not clear why the additional info about $X$ due to having $Y$ would be the same as the additional information about $Y$ due to having $X$. It's not immediately obvious that predicting $Y$ from the behavior of $X$ is just as easy as predicting $X$ from the behavior of $Y$. While "mutual" is in the name, mutual information is described in terms of learning about $X$ using Y, and so in the same way that e.g. KL divergence (which is described in terms of describing $X$ using $Y$) is asymmetric, the intuition for me would have been that "mutual" information is asymmetric too.

Is there any intuition here without just looking at the formula for $I(X;Y)$?

Best Answer

I know that mutual information is the Kullback-Leibler divergence between $p(x,y)$ and $p(x)p(y)$.

Yes and no. Kullback–Leibler divergence between two distributions $p$ and $q$ is defined as

$$ \sum_x p(x) \, \log\Big( \frac{p(x)}{q(x)} \Big) $$

where it is asymmetric because $\tfrac{p(x)}{q(x)} \ne \tfrac{q(x)}{p(x)}$ and because you weight it by $p(x)$ that would give different result as weighting by $q(x)$.

Mutual information between two random variables $X$ and $Y$ is

$$ \sum_{x,y} p_{X,Y}(x, y) \,\log\Big( \frac{p_{X,Y}(x, y)}{p_X(x) \,p_Y(y)} \Big) $$

Notice that $p_{X,Y}(x, y) = p_{Y,X}(y,x)$, same as $p_X(x) \, p_Y(y) = p_Y(y) \, p_X(x)$ so exchanging $X$ with $Y$ would not change the result, it is symmetric.

It would be asymmetric if you changed places of $p_{X,Y}(x, y)$ and $p_X(x) \, p_Y(y)$, but this wouldn't be mutual information anymore, but rather KL divergence. It also wouldn't make much sense, because $p_{X,Y}(x, y)$ is the joint distribution, while using independence in $p_X(x) \, p_Y(y)$ serves as "worst case scenario" that we compare to.

Without just looking at the formula, it's not clear why the additional info about $X$ due to having $Y$ would be the same as the additional information about $Y$ due to having $X$.

This sounds more like the definition of conditional entropy. Mutual information measures “mutual dependence between the two variables” and is symmetric.