Solved – Calculating mutual Information

information theorymachine learningmutual information

I am making a spam filter for an assignment. One of the steps outlined in a paper that I have read indicates finding the mutual information for all the words in the text and then choosing the 500 or so with the highest MI.

The equation given in the paper (and elsewhere) is:

$MI(X_i, C) = \sum{P(X_i, C) \log{\frac{P(X_i, C)}{P(X_i)\cdot P(C)}}}$

where $X_i$ is a feature and $C$ is a class.

It is not clear in the paper but $X_i$ will be a word and $C$ will be either spam or not-spam.

Can I just be clear of the calculation that I should be doing? I think it is as follows for the word "viagra":

$P(X_i, C)$ in this case will be either the frequency that the word "viagra" appears in all of the spam emails divided by the number of words in the spam emails, or the same for the ham (non-spam) emails.

$P(X_i)$ is the probability of the word occurring at all in both spam and ham emails.

$P(C)$ is the prior for a class.

So basically we are doing two calculations (one for the spam and one for the ham) and adding them together.

Is this right do you think?

My understanding is that mutual information is like finding the information gain for a feature. Is this right?

Hope this makes sense. Many thanks in advance.

Best Answer

To be a little more explicit, what you are really doing is performing a double sum where you are iterating over all possible states of feature $X_i$ as well as all possible states of $C$. In your particular instance $C$ happens to be binary; that is, it has only two states (spam and ham).

As for $X_i$, it's a little unclear but I'm guessing that is also a binary feature that indicates either the presence or absence of word $i$. So there are really 4 terms in your summation: word $i$ is present in spam, word $i$ is absent in spam, word $i$ is present in ham, and word $i$ is absent in ham.

So the $P(X_i)$ parts are really:

  • P(word $i$ is present)
  • P(word $i$ is absent)

The $P(C)$ parts are really:

  • P(class is spam)
  • P(class is ham)

Then the $P(X_i, C)$ bits are really:

  • P(word $i$ is present and class is spam)
  • P(word $i$ is absent and class is spam)
  • P(word $i$ is present and class is ham)
  • P(word $i$ is absent and class is ham)

You can then plug these values into the summation accordingly.

Related Question