Solved – Calculating pointwise mutual information between two strings

computational-statisticsmathematical-statisticsmutual informationnatural languagetext mining

I have a dataset consisting 5000 sentences. I need to calculate PMI between 3-gram and 5-grams in this dataset.

For example:

The 5-gram is: $x_1$$x_2$$x_3$$x_4$$x_5$

And the 3-gram is: $x_2$$x_3$$x_4$

How can I calculate PMI($x_2$$x_3$$x_4$, $x_1$$x_2$$x_3$$x_4$$x_5$) in this dataset? What is the exact formula?

As far as I know, in order to calculate PMI(y,z), it is needed to keep track of these counts in the dataset:

Count(y,z) –> The number of co-occurrences of y and z

Count(i,z) –> The number of occurrences of z

Count(y,i) –> The number of occurrences of y

N –> The sample size

The final formula is:

$PMI(y,z) = \frac{Count(y,z)N}{Count(i,z)Count(y,i)}$

In the context of my problem, these counts are listed as below:

$Count_1$($x_2$$x_3$$x_4$, $x_1$$x_2$$x_3$$x_4$$x_5$) : All
occurrences of exact 5-gram " $x_1$$x_2$$x_3$$x_4$$x_5$ "

$Count_2$(i, $x_1$$x_2$$x_3$$x_4$$x_5$) : All occurrences of exact
5-gram "$x_1$$x_2$$x_3$$x_4$$x_5$ " !!!!!

$Count_3$($x_2$$x_3$$x_4$, i) : All occurences of 5-grams in the form
of " _ $x_2$$x_3$$x_4$ _ ", that is the first and last words are
subsituted with all the possible words in the dataset.

$N$ : All the 3-grams. !!!!!

My problem is with $Count_2$ and $N$. As you see, $Count_2$ is equal to $Count_1$. Is it sensible? And I'm not sure about the way I have counted $N$.

Best Answer

Let's start by taking a look at where your expression for PMI comes from. According to this article, for a pair of outcomes $x$ and $y$, $$PMI(x,y) = \log\left[\frac{p(x,y)}{p(x)p(y)}\right]$$ This says that, in order to calculate PMI properly, you need to somehow define a rule for associating the observation of your $n$-grams with a probability.

In the context of your particular data set, which can be cleanly divided into 5000 sentences, a very natural thing to define would be the probabilities for various $n$-grams to appear in a single sentence. To calculate the PMI, we can start by defining two different outcomes:

  • Outcome 1: the 3-gram $\vec{x} = (x_{1}, x_{2}, x_{3})$ appears in a given sentence
  • Outcome 2: the 5-gram $\vec{y} = (y_{1}, y_{2}, y_{3}, y_{4}, y_{5})$ appears in a given sentence

In general, the 3-gram and 5-gram need not contain any words in common in order to calculate a valid PMI, however if you wish to set $x_{1} = y_{2}$, $x_{2} = y_{3}$, etc., you may certainly do so and still calculate a mathematically well-defined result.

To obtain $p\left(\vec{x}, \vec{y}\right)$, the joint probability that both the 3-gram and the 5-gram appear simultaneously in the same sentence, we would simply count the number of sentences $n\left(\vec{x},\vec{y}\right)$ in your data set which contain both $\vec{x}$ and $\vec{y}$ together, and divide by the total number of sentences $N=5000$; i.e., $$p\left(\vec{x}, \vec{y}\right) = \frac{n\left(\vec{x},\vec{y}\right)}{N}$$ Similarly, $p\left(\vec{x}\right)$ and $p\left(\vec{y}\right)$ are defined by the number of times that $\vec{x}$ and $\vec{y}$ are observed individually in each of the 5 sentences. Thus, the PMI is defined by $$PMI\left(\vec{x},\vec{y}\right) = \log\left\{\frac{\left[\frac{n\left(\vec{x},\vec{y}\right)}{N}\right]}{\left[\frac{n\left(\vec{x}\right)}{N}\right]\left[\frac{n\left(\vec{y}\right)}{N}\right]}\right\} = \log\left[\frac{n\left(\vec{x},\vec{y}\right)N}{n\left(\vec{x}\right)n\left(\vec{y}\right)}\right]$$ which looks superficially similar to your result, with a few key differences (e.g., you forgot the include the $\log$ function, etc.). This formula is valid regardless of whether or not the 3-gram $\vec{x}$ and the 5-gram $\vec{y}$ have any words in common, and it is defined with the understanding that:

  • $N$ is the number of sentences in the data set (5000, in this case)
  • The values $n\left(\vec{x}, \vec{y}\right)$, $n\left(\vec{x}\right)$, and $n\left(\vec{y}\right)$ count the number of times that $\vec{x}$ and $\vec{y}$ appear with all words simultaneously together in the same sentence (i.e., if 2 of the words in a 5-gram appear in one sentence and 3 of the words appear in the next, it doesn't actually count as a valid 5-gram because the words aren't found all together within the same sentence)

In the question as you originally stated it, you considered a very special and unusual case: one where every word in the 3-gram was identical to a word in the 5-gram; i.e., $x_{1}=y_{2}$, $x_{2}=y_{3}$, $x_{3}=y_{4}$. In this unique circumstance, as you correctly observed, $n\left(\vec{x},\vec{y}\right) = n\left(\vec{y}\right)$, i.e., the number of sentences in which the 3-gram and 5-gram appear jointly is the same as the number of in which the 5-gram appears in total. Thus, in this special case, those two terms cancel, and we are left with $$PMI\left(\vec{x},\vec{y}\right) = \log\left[\frac{N}{n\left(\vec{x}\right)}\right]$$ This is a perfectly valid result, assuming that you are trying to calculate the PMI for this special case. However, it's not a very general result; usually, you'd be considering cases where the words of the 3-gram and 5-gram don't overlap, and thus those count values would not cancel.

Related Question