If you define $I(X; X)$ for continuous random variables at all, the proper value for it is infinite, not $I(X; X) = H(X)$. Essentially, the value of $X$ gives you an infinite amount of information about $X$. If $X$ is simply a uniformly random real number for instance, it almost surely takes infinite number of bits to describe it (there's no pattern like in e.g. pi).
OTOH, for different variables $X$ and $Y$ (no matter how similar), the value of $X$ always gives you only a finite amount of information about $Y$. If you zoom in sufficiently to some point in $p(x, y)$, it will look flat, so $X$ and $Y$ are practically independent inside that region. Nevertheless, describing where that region is takes a finite number of bits, and specifying the exact point in the region takes an infinite number of bits. The shared information about $X$ and $Y$ is in that finite number of bits, so the mutual information is finite. If, however, $X=Y$, then no matter how much you zoom, knowing $X$ will always tell you exactly where $Y$ is, giving you an infinity of information. That's why $I(X; X)$ is very different from $I(X, Y)$.
If that's not convincing, you can just try some calculations. Example: the mutual information of $(x, y)$ for a bivariate Gaussian with $Var(x) = Var(y) = 1$ and $Cov(x, y) = r$ is $I(x; y) = -0.5log(1-r^2)$, which goes to infinity as $r$ goes to $1$.
Yes, because
$$\text{Corr}(X,Y)\ne0 \Rightarrow \text{Cov}(X,Y)\ne0$$
$$\Rightarrow E(XY) - E(X)E(Y) \ne 0 $$
$$\Rightarrow \int \int xyf_{X,Y}(x,y)dxdy -\int xf_X(x) dx\int yf_Y(y)dy \ne 0$$
$$\Rightarrow \int \int xyf_{X,Y}(x,y)dxdy -\int \int xyf_X(x) f_Y(y)dxdy \ne 0$$
$$\Rightarrow \int \int xy \big[f_{X,Y}(x,y) -f_X(x) f_Y(y)\big]dxdy \ne 0$$
which would be impossible if $f_{X,Y}(x,y) -f_X(x) f_Y(y) =0,\;\; \forall \{x,y\}$. So
$$\text{Corr}(X,Y)\ne0 \Rightarrow \exists \{x,y\}:f_{X,Y}(x,y) \ne f_X(x) f_Y(y)$$
Question: what happens with random variables that have no densities?
Best Answer
Mutual information is zero if and only if $p(x,y) = p(x) p(y)$ and this condition implies that correlation is zero. So, if correlation is non-zero, then mutual information need to be non-zero.