[Math] Mutual information vs Information Gain

entropyinformation theory

I always thought that mutual information and information gain refer to the same thing, however looking at Wikipedia:

http://en.wikipedia.org/wiki/Information_gain

https://en.wikipedia.org/wiki/Mutual_information

I see that Information gain is something completely different and asymmetrical,

what are the differences in practice ? when should I choose one of the other ?

Best Answer

We know that $H(X)$ quantifies the amount of information that each observation of $X$ provides, or, equivalently, the minimal amount of bits that we need to encode $X$ ($L_X \to H(X)$, where $L_X$ is the optima average codelength - first Shannon theorem)

The mutual information $$I(X;Y)=H(X) - H(X \mid Y)$$ measures the reduction in uncertainity (or the "information gained") for $X$ when $Y$ is known.

It can be written as $$I(X;Y)=D(p_{X,Y}\mid \mid p_X \,p_Y)=D(p_{X\mid Y} \,p_Y \mid \mid p_X \,p_Y)$$ wher $D(\cdot)$ is the Kullback–Leibler divergence or distance, or relative entropy... or information gain (this later term is not so much used in information theory, in my experience).

So, they are the same thing. Granted, $D(\cdot)$ is not symmetric in its arguments, but don't let confuse you. We are not computing $D(p_X \mid \mid p_Y)$, but $D(p_X \,p_Y\mid \mid p_{X,Y})$, and this is symmetric in $X,Y$.

A slightly different situation (to connect with this) arises when one is interested in the effect of knowing a particular value of $Y=y$ . In this case, because we are not averaging on $y$, the amount of bits gained [*] would be $ D(p_{X\mid Y} \mid \mid p_X )$... which depends in $y$.

[*] To be precise, that's actually the amount of bits we waste when coding the conditioned source $X\mid Y=y$ as if we didn't knew $Y$ (using the unconditioned distribution of $X$)