Information Theory – Why Perfectly Similar Data Have Zero Mutual Information?

entropyinformation theorymutual information

I'm not a statistic major, so my knowledge of statistics is quite limited but I've found myself in need of learning about and using mutual information. I believe I understand the concept and formula, but it seems counter-intuitive that perfectly similar data would have 0 mutual information. I would expect two sets of data with perfect similarity to have mutual information of 1.

I'm writing a program which calculates Mutual Information between two columns (i and j) in a pairwise protein sequence, and I'm not sure if I should manually fix the MI to be 1 if the columns are exactly the same. For example, if the two columns (written as lines) were:

CLLYFFDTTQGILMIGCL
IILLIIIIFLIVILVIFV

Then I've calculated the MI by hand to be equal to 0.8384. However, if the two columns (written as lines) were:

GGGGGGGGGG
GGGGGGGGGG

The mutual information will be equal to 0 because the formula will result in:

$$\sum \left( p(1) \times \log \left( \frac{p(1)}{p(1) p(1)} \right) \right)
= 1 \times \log \left( \frac{1}{1} \right)
= 0$$

In my program I have it check if the columns are identical (i = j) and just manually assign Mi = 1. So I'm wondering: Is this bad practice? Does it only consider one possible case of many? Should I not do this at all? The professor I'm working with suggested I incorporate a pseudo-count for every other non-existing amino acid and ignoring a manual fix for when i = j.

Would a pseudo-count fix this? Counting each existing amino acid as 1.1 and everything else as 0.1, or would it result in the same mutual information? Again, I'm far from a statistics major so if you can explain simply, that would be best. Thanks.

Best Answer

Mutual information $I(X, Y)$ can be thought as a measure of reduction in uncertainty about $X$ after observing $Y$:

$$ I(X, Y) = H(X) - H(X|Y)$$

where $H(X)$ is entropy of $X$ and $H(X|Y)$ is conditional entropy of $X$ given $Y$. By symmetry it follows that

$$ I(X, Y) = H(Y) - H(Y|X)$$

However mutual information of a variable with itself is equal to entropy of this variable

$$ I(X, X) = H(X)$$

and is called self-information. This is true since $H(X|Y) = 0$ if values of $X$ are completely determined by $Y$ and this is true for $H(X|X)$. It is so because entropy is a measure of uncertainty and there is no uncertainty in reasoning on values of $X$ given the values of $X$, so

$$ X(X) - X(X|X) = X(X) - 0 = H(X) $$

This is immediately obvious if you think of it in terms of Venn diagrams as illustrated below.

enter image description here

You can also show this using the formula for mutual information and substituting the conditional entropy part, i.e.

$$ H(X|Y) = \sum_{x \in X, y \in Y} p(x, y) \log \frac{p(x,y)}{p(x)} $$

by changing $y$'s into $x$'s and with recalling that $X \cap X = X$, so $p(x, x) = p(x)$. [Notice that this is an informal argumentation, since for continuous variables $p(x, x)$ would not have a density function, while having cumulative distribution function.]

So yes, if you know something about $X$, then learning again about $X$ gives you no more information.

Check Chapter 2 of Elements of Information Theory by Cover and Thomas, or Shanon's original 1948 paper itself for learning more.


As about your second question, this is a common problem that in your data you do observe some values that possibly can occur. In this case the classical estimator for probability, i.e.

$$ \hat p = \frac{n_i}{\sum_i n_i} $$

where $n_i$ is a number of occurrences of $i$th value (out of $d$ categories), gives you $\hat p = 0$ if $n_i = 0$. This is called zero-frequency problem. The easy and commonly applied fix is, as your professor told you, to add some constant $\beta$ to your counts, so that

$$ \hat p = \frac{n_i + \beta}{(\sum_i n_i) + d\beta} $$

The common choice for $\beta$ is $1$, i.e. applying uniform prior based on Laplace's rule of succession, $1/2$ for Krichevsky-Trofimov estimate, or $1/d$ for Schurmann-Grassberger (1996) estimator. Notice however that what you do here is you apply out-of-data (prior) information in your model, so it gets subjective, Bayesian flavor. With using this approach you have to remember of assumptions you made and take them into consideration.

This approach is commonly used, e.g. in R enthropy package. You can find some further information in the following paper:

Schurmann, T., and P. Grassberger. (1996). Entropy estimation of symbol sequences. Chaos, 6, 41-427.

Related Question