Mutual Information vs Correlation – Understanding the Differences

correlationmathematical-statisticsmutual information

Why and when we should use Mutual Information over statistical correlation measurements such as "Pearson", "spearman", or "Kendall's tau" ?

Best Answer

Let's consider one fundamental concept of (linear) correlation, covariance (which is Pearson's correlation coefficient "un-standardized"). For two discrete random variables $X$ and $Y$ with probability mass functions $p(x)$, $p(y)$ and joint pmf $p(x,y)$ we have

$$\operatorname{Cov}(X,Y) = E(XY) - E(X)E(Y) = \sum_{x,y}p(x,y)xy - \left(\sum_xp(x)x\right)\cdot \left(\sum_yp(y)y\right)$$

$$\Rightarrow \operatorname{Cov}(X,Y) = \sum_{x,y}\left[p(x,y)-p(x)p(y)\right]xy$$

The Mutual Information between the two is defined as

$$I(X,Y) = E\left (\ln \frac{p(x,y)}{p(x)p(y)}\right)=\sum_{x,y}p(x,y)\left[\ln p(x,y)-\ln p(x)p(y)\right]$$

Compare the two: each contains a point-wise "measure" of "the distance of the two rv's from independence" as it is expressed by the distance of the joint pmf from the product of the marginal pmf's: the $\operatorname{Cov}(X,Y)$ has it as difference of levels, while $I(X,Y)$ has it as difference of logarithms.

And what do these measures do? In $\operatorname{Cov}(X,Y)$ they create a weighted sum of the product of the two random variables. In $I(X,Y)$ they create a weighted sum of their joint probabilities.

So with $\operatorname{Cov}(X,Y)$ we look at what non-independence does to their product, while in $I(X,Y)$ we look at what non-independence does to their joint probability distribution.

Reversely, $I(X,Y)$ is the average value of the logarithmic measure of distance from independence, while $\operatorname{Cov}(X,Y)$ is the weighted value of the levels-measure of distance from independence, weighted by the product of the two rv's.

So the two are not antagonistic—they are complementary, describing different aspects of the association between two random variables. One could comment that Mutual Information "is not concerned" whether the association is linear or not, while Covariance may be zero and the variables may still be stochastically dependent. On the other hand, Covariance can be calculated directly from a data sample without the need to actually know the probability distributions involved (since it is an expression involving moments of the distribution), while Mutual Information requires knowledge of the distributions, whose estimation, if unknown, is a much more delicate and uncertain work compared to the estimation of Covariance.