Solved – Mutual Information really invariant to invertible transformations

entropyinformation theorymutual information

"Estimating mutual information" [A Kraskov, H Stögbauer, P Grassberger – Physical Review E, 2004]
states that

Mutual information is invariant under reparametrization of the marginal variables. If $X ′ = F(X)$ and $Y′ =G(Y)$ are homeomorphisms [ie. smooth uniquely invertible maps], then $$I(X,Y ) = I(X′,Y′)$$

This paper is also used on Wikipedia to justify the same claim.

But if that is true, doesn't that imply that if $F=G$ and $X=Y$ we get
$$H(X) = I(X,X) = I(X',X') = I(F(X),F(X)) = H(F(X))$$
where $H$ is the differential entropy?

Wouldn't this be "proof" that differential entropy is also invariant wrt. such transformations (which it obviously isn't because e.g. for a constant $a$, $H(aX) = H(X) + \log|a| \neq H(X)$)?

In particular I'm wondering if $I(X,F(X))=H(X)=H(F(X))$ for all homeomorphisms $F$.

Can someone help me reduce my uncertainty?

PS: I'm talking about the continous case, ie. differential entropy and differential mutual information

Best Answer

If you define $I(X; X)$ for continuous random variables at all, the proper value for it is infinite, not $I(X; X) = H(X)$. Essentially, the value of $X$ gives you an infinite amount of information about $X$. If $X$ is simply a uniformly random real number for instance, it almost surely takes infinite number of bits to describe it (there's no pattern like in e.g. pi).

OTOH, for different variables $X$ and $Y$ (no matter how similar), the value of $X$ always gives you only a finite amount of information about $Y$. If you zoom in sufficiently to some point in $p(x, y)$, it will look flat, so $X$ and $Y$ are practically independent inside that region. Nevertheless, describing where that region is takes a finite number of bits, and specifying the exact point in the region takes an infinite number of bits. The shared information about $X$ and $Y$ is in that finite number of bits, so the mutual information is finite. If, however, $X=Y$, then no matter how much you zoom, knowing $X$ will always tell you exactly where $Y$ is, giving you an infinity of information. That's why $I(X; X)$ is very different from $I(X, Y)$.

If that's not convincing, you can just try some calculations. Example: the mutual information of $(x, y)$ for a bivariate Gaussian with $Var(x) = Var(y) = 1$ and $Cov(x, y) = r$ is $I(x; y) = -0.5log(1-r^2)$, which goes to infinity as $r$ goes to $1$.

Related Question