I came across some papers on statistical methods in natural language processing, particularly Ted Dunning's paper, and there I found the formula that he has used to calculate Log Likelihood Ratio, which seemed awfully similar to the formula for Pointwise Mutual Information. I would like to know the principal difference between those two measures?
Solved – Difference between pointwise mutual information and log likelihood ratio
likelihood-ratiomutual informationnatural language
Related Solutions
With my limited knowledge, I think:
- "the probability of observing w in input" requires a distribution in order to compute the value
- "the probability of observing w in both the input and in the background corpus assuming equal probabilities in both corpora" means "the likelihood of observing w ... given that the probability for w is equal in both corpora".
Here's my formulation for it:
Formulating the problem a little:
- Hypothesis 1: P(w in input) = P(w in background) = p
- Hypothesis 2: P(w in input) = p1 and P(w in background) = p2 and p1 $\ne$ p2
The critical part is that you will need to assume a distribution here. Simplistically, we assume Binomial distribution for generating w in a text. Given the sampledata, we can use maximum likelihood estimation to compute the value for p, p1, and p2, and here they are:
- p = (count-of-w-in-input + count-of-w-in-background) / (input-size + background-size) = (c1 + c2) / (N1 + N2)
- p1 = c1 / N1
- p2 = c2 / N2
We want to know which hypothesis is more likely. Therefore, we compute the likelihood of each hypothesis and compare to each other (which is basically what the likelihood ratio does).
Since we assume binomial distribution, we can compute the likelihood of having c1 and c2.
For Hypothesis 1:
L(c1) = The probability of observing w in input = the likelihood of achieving c1 when there is N1 words assuming the probability p (or, in other words, selecting w for c1 times out of N1 times) is b(N1, c1, p) -- please see the binomial probability formula here
L(c2) = The probability of observing w in background = the likelihood of achieving c2 when there is N2 words assuming the probability p is b(N2, c2, p)
For Hypothesis 2, we can use p1 and p2 instead.
Now we want to know which hypothesis is more likely; we will need to some how compare an output value from each hypothesis.
But each hypothesis has 2 values, L(c1) and L(c2). How can we compare which hypothesis is more likely? --- We choose to multiply them together to achieve a single-valued output. (because it's analogous to geometry, I guess)
Let's start by taking a look at where your expression for PMI comes from. According to this article, for a pair of outcomes $x$ and $y$, $$PMI(x,y) = \log\left[\frac{p(x,y)}{p(x)p(y)}\right]$$ This says that, in order to calculate PMI properly, you need to somehow define a rule for associating the observation of your $n$-grams with a probability.
In the context of your particular data set, which can be cleanly divided into 5000 sentences, a very natural thing to define would be the probabilities for various $n$-grams to appear in a single sentence. To calculate the PMI, we can start by defining two different outcomes:
- Outcome 1: the 3-gram $\vec{x} = (x_{1}, x_{2}, x_{3})$ appears in a given sentence
- Outcome 2: the 5-gram $\vec{y} = (y_{1}, y_{2}, y_{3}, y_{4}, y_{5})$ appears in a given sentence
In general, the 3-gram and 5-gram need not contain any words in common in order to calculate a valid PMI, however if you wish to set $x_{1} = y_{2}$, $x_{2} = y_{3}$, etc., you may certainly do so and still calculate a mathematically well-defined result.
To obtain $p\left(\vec{x}, \vec{y}\right)$, the joint probability that both the 3-gram and the 5-gram appear simultaneously in the same sentence, we would simply count the number of sentences $n\left(\vec{x},\vec{y}\right)$ in your data set which contain both $\vec{x}$ and $\vec{y}$ together, and divide by the total number of sentences $N=5000$; i.e., $$p\left(\vec{x}, \vec{y}\right) = \frac{n\left(\vec{x},\vec{y}\right)}{N}$$ Similarly, $p\left(\vec{x}\right)$ and $p\left(\vec{y}\right)$ are defined by the number of times that $\vec{x}$ and $\vec{y}$ are observed individually in each of the 5 sentences. Thus, the PMI is defined by $$PMI\left(\vec{x},\vec{y}\right) = \log\left\{\frac{\left[\frac{n\left(\vec{x},\vec{y}\right)}{N}\right]}{\left[\frac{n\left(\vec{x}\right)}{N}\right]\left[\frac{n\left(\vec{y}\right)}{N}\right]}\right\} = \log\left[\frac{n\left(\vec{x},\vec{y}\right)N}{n\left(\vec{x}\right)n\left(\vec{y}\right)}\right]$$ which looks superficially similar to your result, with a few key differences (e.g., you forgot the include the $\log$ function, etc.). This formula is valid regardless of whether or not the 3-gram $\vec{x}$ and the 5-gram $\vec{y}$ have any words in common, and it is defined with the understanding that:
- $N$ is the number of sentences in the data set (5000, in this case)
- The values $n\left(\vec{x}, \vec{y}\right)$, $n\left(\vec{x}\right)$, and $n\left(\vec{y}\right)$ count the number of times that $\vec{x}$ and $\vec{y}$ appear with all words simultaneously together in the same sentence (i.e., if 2 of the words in a 5-gram appear in one sentence and 3 of the words appear in the next, it doesn't actually count as a valid 5-gram because the words aren't found all together within the same sentence)
In the question as you originally stated it, you considered a very special and unusual case: one where every word in the 3-gram was identical to a word in the 5-gram; i.e., $x_{1}=y_{2}$, $x_{2}=y_{3}$, $x_{3}=y_{4}$. In this unique circumstance, as you correctly observed, $n\left(\vec{x},\vec{y}\right) = n\left(\vec{y}\right)$, i.e., the number of sentences in which the 3-gram and 5-gram appear jointly is the same as the number of in which the 5-gram appears in total. Thus, in this special case, those two terms cancel, and we are left with $$PMI\left(\vec{x},\vec{y}\right) = \log\left[\frac{N}{n\left(\vec{x}\right)}\right]$$ This is a perfectly valid result, assuming that you are trying to calculate the PMI for this special case. However, it's not a very general result; usually, you'd be considering cases where the words of the 3-gram and 5-gram don't overlap, and thus those count values would not cancel.
Related Question
- Cart Likelihood-Ratio Information-Theory – Understanding the Relationship Between GINI Score and Log-Likelihood Ratio
- Solved – Information gain and mutual information: different or equal
- Solved – What are the pros and cons of applying pointwise mutual information on a word cooccurrence matrix before SVD
Best Answer
They're very closely related. "Log-likelihood" is not a very useful name, because it sounds like it's going to be a log probability. But it's not that at all. "Log likelihood ratio" as used at times in the original paper is more correct, and that immediately hints that we should expect to see this metric as a comparison (ratio) between two different probabilities. Furthermore, the formula is structured as a weighted average of logarithms, just like the basic formula for entropy ($\sum_p p \log(p)$). These all strongly hint that this metric is closely related to an information-theoretic metric like mutual information.
Here's the formula for log likelihood given $p_n$, $k_n$, and $n_n$ as it appears in the paper:
$$ \mathrm{logL}(p_n, n_n, k_n) = k_n \log(p_n) + (n_n - k_n) \log (1 - p_n) $$
As the paper itself notes, $p$ is just $k / n$, so we need only normalize by $n$ to get the entropy formula for a Bernoulli distribution (coin flips with a weighted coin, with probability $p$ of turning up heads):
$$ \begin{align} \mathrm{BernoulliEntropy}(p, n, k) & = \frac{k}{n} \log(p) + \frac{(n - k)}{n} \log (1 - p) \\ \mathrm{BernoulliEntropy}(p) & = p \log(p) + (1 - p)\log(1 - p) \end{align} $$
This normalization is the only real difference, and I find it a little puzzling that the author didn't adopt this simplified approach.
The other equation we need (using this new $n$-normalized formulation) is the formula for cross-entropy. It is almost identical, but it compares two different probabilities: it gives us a way to measure the "cost" of representing one probability distribution with another. (Concretely, suppose you compress data from one distribution with an encoding optimized for another distribution; this tells you how many bits it will take on average.)
$$ \mathrm{CrossEntropy}(p,p') = p \log(p') + (1 - p) \log (1 - p') $$
Note that if $p$ and $p'$ are the same, then this winds up being identical to Bernoulli entropy.
$$ \mathrm{BernoulliEntropy}(p) = \mathrm{CrossEntropy}(p, p) $$
To put these formulas together, we just have to specify the exact comparison we want to make. Suppose we have data from two different coin-flipping sessions, and we want to know whether the same coin was used in both sessions, or whether there were two different coins with different weights.
We propose a null hypothesis: the coin used was the same. To test that hypothesis, we need to perform a total of eight weighted log calculations. Four of them will be simple entropy calculations, while the other four will be cross-entropy calculations; the difference will be in the probabilistic model we use. We will use the data for each session to calculate two different probabilities: $p_{1h}$ and $p_{2h}$. Then we will use all the data to calculate just one combined probability $p_{ch}$. The cross-entropy calculations will compare session probabilities to combined probabilities; the entropy calculations will compare session probabilities to themselves. Under the null hypothesis, the values will be the same, and will cancel each other out.
Recall that the logarithm of a probability is always negative; for clarity later, the formulas below are negated so that they give positive values.
Cross entropy is always equal to or greater than entropy, so we want to subtract entropy from cross entropy to get a meaningful value. If they are close to equal, then the result will be zero, and we accept the null hypothesis. If not, then the value is guaranteed to be positive, and for larger and larger values, we will be more and more inclined to reject the null hypothesis.
Recall that logarithms allow us to convert subtraction into division, so subtracting the first entropy value from the first cross entropy value gives
$$ \begin{align} & -p_{1h} \log(p_{ch}) - -p_{1h} \log(p_{1h}) \\ =\ & p_{1h} \log(p_{1h}) - p_{1h} \log(p_{ch}) \\ =\ & p_{1h} \log\frac{p_{1h}}{p_{ch}} \\ \end{align} $$
Combining all the terms and logarithms together, we get this definition of the combined, normalized log likelihood ratio, $ \mathrm{logL'} $:
$$ \begin{align} \mathrm{logL'} &=\ p_{1h}\ \log\frac{p_{1h}}{p_{ch}} \\ &+ (1 - p_{1h})\ \log\frac{(1 - p_{1h})}{(1 - p_{ch})} \\ &+ p_{2h}\ \log\frac{p_{2h}}{p_{ch}} \\ &+ (1 - p_{2h})\ \log\frac{(1 - p_{2h})}{(1 - p_{ch})} \\ \end{align} $$
At this point, converting this to PMI is just a matter of reinterpreting the notation. $p_{1h}$ is the probability of turning up heads in the first session, and so we could also call it the conditional probability of heads given that we're looking only at data from the first session. We can do similar things to the other probabilities from each session:
$$ \begin{align} p_{1h} &= \mathrm{P}(h|c_1) \\ (1 - p_{1h}) &= \mathrm{P}(t|c_1) \\ p_{2h} &= \mathrm{P}(h|c_2) \\ (1 - p_{2h}) &= \mathrm{P}(t|c_2) \end{align} $$
The null hypothesis probabilities are not conditional but "prior" probabilities -- probabilities calculated without taking into account additional information about the sessions:
$$ \begin{align} p_{ch} &= \mathrm{P}(h) \\ 1-p_{ch} &= \mathrm{P}(t) \\ \end{align} $$
Now, by the definition of conditional probability we have
$$ \begin{align} \frac{\mathrm{P}(h,c)}{\mathrm{P}(c)} &= \mathrm{P}(h|c) \\ \frac{\mathrm{P}(h,c)}{\mathrm{P}(c)\mathrm{P}(h)} &= \frac{\mathrm{P}(h|c)}{\mathrm{P}(h)} \end{align} $$
And we see that by converting the $\mathrm{logL'}$ formulas to use conditional probability notation, we have (for example)
$$ \begin{align} & \mathrm{P}(h|c_1)\ \log\frac{\mathrm{P}(h|c_1)}{\mathrm{P}(h)} \\ \end{align} $$
This is exactly the (pointwise) formula for $\mathrm{D_{KL}}((h|c_1)\ ||\ h)$, the KL divergence of the conditional distribution of heads in session one from the prior distribution of heads. So the log likelihood, when normalized by the number of trials in each session, is the same as the sum, for each possible outcome, of KL divergences of conditional distributions from prior distributions. If you understand KL divergence, this provides a good intuition for how this test works: it measures the "distance" between the conditional and unconditional probabilities for each outcome. If the difference is large, then the null hypothesis is probably false.
The relationship between mutual information and KL divergence is well known. So we're nearly done. Starting from the above formula, we have
$$ \begin{align} & \mathrm{P}(h|c_1)\ \log\frac{\mathrm{P}(h|c_1)}{\mathrm{P}(h)} \\ =\ & \mathrm{P}(h|c_1)\ \log\frac{\mathrm{P}(h,c_1)}{\mathrm{P}(c_1)\mathrm{P}(h)} \\ =\ & \mathrm{P}(h|c_1) \cdot \mathrm{PMI}(h; c_1) \end{align} $$
Where the last version is based on the definition of Pointwise Mutual Information (as given here). Putting it all together:
$$ \begin{align} \mathrm{logL'} &= \mathrm{P}(h|c_1) \cdot \mathrm{PMI}(h; c_1) \\ &+ \mathrm{P}(t|c_1) \cdot \mathrm{PMI}(t; c_1) \\ &+ \mathrm{P}(h|c_2) \cdot \mathrm{PMI}(h; c_2) \\ &+ \mathrm{P}(t|c_2) \cdot \mathrm{PMI}(t; c_2) \\ \end{align} $$
We could recover the pre-normalized version by using total counts from the first and second sessions: multiplying $\mathrm{P}(h|c_n)$ by the number of trials in session $n$ gives the number of heads in session $n$, which recovers the original definition of $\mathrm{logL}$ at the top of this answer.
Dividing that number by the total number of trials would give $\mathrm{P}(h,c_n)$, converting this formula into the formula for mutual information, the weighted sum of PMI values for each outcome. So the difference between "log likelihood" and mutual information (pointwise or otherwise) is just a matter of normalization scheme.