Solved – What are the pros and cons of applying pointwise mutual information on a word cooccurrence matrix before SVD

language-modelsmutual informationnatural languagesvdword embeddings

One way to generate word embeddings is as follows (mirror):

Get a corpora, e.g. "I enjoy flying. I like NLP. I like deep learning."
Build the word cooccurrence matrix from it:

Perform SVD on $X$, and keep the first $k$ columns of U.

Each row of the submatrix $U_{1:|V|,1:k}$ will be the word embedding of the word that the row represents (row 1 = "I", row 2 = "like", …).

Between steps 2 and 3, pointwise mutual information is sometimes applied (e.g. A. Herbelot and E.M. Vecchi. 2015. Building a shared world: Mapping distributional to model-theoretic semantic spaces. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Lisbon, Portugal.).

What are the pros and cons of applying pointwise mutual information on a word cooccurrence matrix before SVD?

Best Answer

according to Dan Jurafsky and James H. Martin book:

"It turns out, however, that simple frequency isn’t the best measure of association between words. One problem is that raw frequency is very skewed and not very discriminative. If we want to know what kinds of contexts are shared by apricot and pineapple but not by digital and information, we’re not going to get good discrimination from words like the, it, or they, which occur frequently with all sorts of words and aren’t informative about any particular word."

sometimes we replace this raw frequency with positive pointwise mutual information:

$$\text{PPMI}(w,c) = \max{\left(\log_{2}{\frac{P(w,c)}{P(w)P(c)}},0\right)}$$

PMI on its own shows how much it's possible to observe a word w with a context word C compare to observing them independently. In PPMI we only keep positive values of PMI. Let's think about when PMI is + or - and why we only keep negative ones:

What does positive PMI mean?

$\frac{P(w,c)}{(P(w)P(c))} > 1$
$P(w,c) > (P(w)P(c))$
it happens when $w$ and $c$ occur mutually more than individually like kick and ball. We'd like to keep these!

What does negative PMI mean?

$\frac{P(w,c)}{(P(w)P(c))} < 1$
$P(w,c) < (P(w)P(c))$
it means both of $w$ and $c$ or one of them tend to occur individually! It might indicate unreliable stats due to limited data otherwise it shows uninformative co-occurrences e.g., 'the' and 'ball'. ('the' occurs with most of the words too.)

PMI or particularly PPMI helps us to catch such situations with informative co-occurrence.

Best Answer

Related Solutions

Solved – the major difference between correlation and mutual information

Solved – Why do people use the term “weight of evidence” and how does it differ from “pointwise mutual information”

Related Question