Solved – Difference between pointwise mutual information and log likelihood ratio

likelihood-ratiomutual informationnatural language

I came across some papers on statistical methods in natural language processing, particularly Ted Dunning's paper, and there I found the formula that he has used to calculate Log Likelihood Ratio, which seemed awfully similar to the formula for Pointwise Mutual Information. I would like to know the principal difference between those two measures?

Best Answer

They're very closely related. "Log-likelihood" is not a very useful name, because it sounds like it's going to be a log probability. But it's not that at all. "Log likelihood ratio" as used at times in the original paper is more correct, and that immediately hints that we should expect to see this metric as a comparison (ratio) between two different probabilities. Furthermore, the formula is structured as a weighted average of logarithms, just like the basic formula for entropy ($\sum_p p \log(p)$). These all strongly hint that this metric is closely related to an information-theoretic metric like mutual information.

Here's the formula for log likelihood given $p_n$, $k_n$, and $n_n$ as it appears in the paper:

$$ \mathrm{logL}(p_n, n_n, k_n) = k_n \log(p_n) + (n_n - k_n) \log (1 - p_n) $$

As the paper itself notes, $p$ is just $k / n$, so we need only normalize by $n$ to get the entropy formula for a Bernoulli distribution (coin flips with a weighted coin, with probability $p$ of turning up heads):

$$ \begin{align} \mathrm{BernoulliEntropy}(p, n, k) & = \frac{k}{n} \log(p) + \frac{(n - k)}{n} \log (1 - p) \\ \mathrm{BernoulliEntropy}(p) & = p \log(p) + (1 - p)\log(1 - p) \end{align} $$

This normalization is the only real difference, and I find it a little puzzling that the author didn't adopt this simplified approach.

The other equation we need (using this new $n$-normalized formulation) is the formula for cross-entropy. It is almost identical, but it compares two different probabilities: it gives us a way to measure the "cost" of representing one probability distribution with another. (Concretely, suppose you compress data from one distribution with an encoding optimized for another distribution; this tells you how many bits it will take on average.)

$$ \mathrm{CrossEntropy}(p,p') = p \log(p') + (1 - p) \log (1 - p') $$

Note that if $p$ and $p'$ are the same, then this winds up being identical to Bernoulli entropy.

$$ \mathrm{BernoulliEntropy}(p) = \mathrm{CrossEntropy}(p, p) $$

To put these formulas together, we just have to specify the exact comparison we want to make. Suppose we have data from two different coin-flipping sessions, and we want to know whether the same coin was used in both sessions, or whether there were two different coins with different weights.

We propose a null hypothesis: the coin used was the same. To test that hypothesis, we need to perform a total of eight weighted log calculations. Four of them will be simple entropy calculations, while the other four will be cross-entropy calculations; the difference will be in the probabilistic model we use. We will use the data for each session to calculate two different probabilities: $p_{1h}$ and $p_{2h}$. Then we will use all the data to calculate just one combined probability $p_{ch}$. The cross-entropy calculations will compare session probabilities to combined probabilities; the entropy calculations will compare session probabilities to themselves. Under the null hypothesis, the values will be the same, and will cancel each other out.

Recall that the logarithm of a probability is always negative; for clarity later, the formulas below are negated so that they give positive values.

  • Entropy calculations:
    • First session
      • Heads: $-p_{1h} \log(p_{1h})$
      • Tails: $-(1 - p_{1h}) \log(1 - p_{1h})$
    • Second session
      • Heads: $-p_{2h} \log(p_{2h})$
      • Tails: $-(1 - p_{2h}) \log(1 - p_{2h})$
  • Cross-entropy calculations:
    • First session
      • Heads: $-p_{1h} \log(p_{ch})$
      • Tails: $-(1 - p_{1h}) \log(1 - p_{ch})$
    • Second session
      • Heads: $-p_{2h} \log(p_{ch})$
      • Tails: $-(1 - p_{2h}) \log(1 - p_{ch})$

Cross entropy is always equal to or greater than entropy, so we want to subtract entropy from cross entropy to get a meaningful value. If they are close to equal, then the result will be zero, and we accept the null hypothesis. If not, then the value is guaranteed to be positive, and for larger and larger values, we will be more and more inclined to reject the null hypothesis.

Recall that logarithms allow us to convert subtraction into division, so subtracting the first entropy value from the first cross entropy value gives

$$ \begin{align} & -p_{1h} \log(p_{ch}) - -p_{1h} \log(p_{1h}) \\ =\ & p_{1h} \log(p_{1h}) - p_{1h} \log(p_{ch}) \\ =\ & p_{1h} \log\frac{p_{1h}}{p_{ch}} \\ \end{align} $$

Combining all the terms and logarithms together, we get this definition of the combined, normalized log likelihood ratio, $ \mathrm{logL'} $:

$$ \begin{align} \mathrm{logL'} &=\ p_{1h}\ \log\frac{p_{1h}}{p_{ch}} \\ &+ (1 - p_{1h})\ \log\frac{(1 - p_{1h})}{(1 - p_{ch})} \\ &+ p_{2h}\ \log\frac{p_{2h}}{p_{ch}} \\ &+ (1 - p_{2h})\ \log\frac{(1 - p_{2h})}{(1 - p_{ch})} \\ \end{align} $$

At this point, converting this to PMI is just a matter of reinterpreting the notation. $p_{1h}$ is the probability of turning up heads in the first session, and so we could also call it the conditional probability of heads given that we're looking only at data from the first session. We can do similar things to the other probabilities from each session:

$$ \begin{align} p_{1h} &= \mathrm{P}(h|c_1) \\ (1 - p_{1h}) &= \mathrm{P}(t|c_1) \\ p_{2h} &= \mathrm{P}(h|c_2) \\ (1 - p_{2h}) &= \mathrm{P}(t|c_2) \end{align} $$

The null hypothesis probabilities are not conditional but "prior" probabilities -- probabilities calculated without taking into account additional information about the sessions:

$$ \begin{align} p_{ch} &= \mathrm{P}(h) \\ 1-p_{ch} &= \mathrm{P}(t) \\ \end{align} $$

Now, by the definition of conditional probability we have

$$ \begin{align} \frac{\mathrm{P}(h,c)}{\mathrm{P}(c)} &= \mathrm{P}(h|c) \\ \frac{\mathrm{P}(h,c)}{\mathrm{P}(c)\mathrm{P}(h)} &= \frac{\mathrm{P}(h|c)}{\mathrm{P}(h)} \end{align} $$

And we see that by converting the $\mathrm{logL'}$ formulas to use conditional probability notation, we have (for example)

$$ \begin{align} & \mathrm{P}(h|c_1)\ \log\frac{\mathrm{P}(h|c_1)}{\mathrm{P}(h)} \\ \end{align} $$

This is exactly the (pointwise) formula for $\mathrm{D_{KL}}((h|c_1)\ ||\ h)$, the KL divergence of the conditional distribution of heads in session one from the prior distribution of heads. So the log likelihood, when normalized by the number of trials in each session, is the same as the sum, for each possible outcome, of KL divergences of conditional distributions from prior distributions. If you understand KL divergence, this provides a good intuition for how this test works: it measures the "distance" between the conditional and unconditional probabilities for each outcome. If the difference is large, then the null hypothesis is probably false.

The relationship between mutual information and KL divergence is well known. So we're nearly done. Starting from the above formula, we have

$$ \begin{align} & \mathrm{P}(h|c_1)\ \log\frac{\mathrm{P}(h|c_1)}{\mathrm{P}(h)} \\ =\ & \mathrm{P}(h|c_1)\ \log\frac{\mathrm{P}(h,c_1)}{\mathrm{P}(c_1)\mathrm{P}(h)} \\ =\ & \mathrm{P}(h|c_1) \cdot \mathrm{PMI}(h; c_1) \end{align} $$

Where the last version is based on the definition of Pointwise Mutual Information (as given here). Putting it all together:

$$ \begin{align} \mathrm{logL'} &= \mathrm{P}(h|c_1) \cdot \mathrm{PMI}(h; c_1) \\ &+ \mathrm{P}(t|c_1) \cdot \mathrm{PMI}(t; c_1) \\ &+ \mathrm{P}(h|c_2) \cdot \mathrm{PMI}(h; c_2) \\ &+ \mathrm{P}(t|c_2) \cdot \mathrm{PMI}(t; c_2) \\ \end{align} $$

We could recover the pre-normalized version by using total counts from the first and second sessions: multiplying $\mathrm{P}(h|c_n)$ by the number of trials in session $n$ gives the number of heads in session $n$, which recovers the original definition of $\mathrm{logL}$ at the top of this answer.

Dividing that number by the total number of trials would give $\mathrm{P}(h,c_n)$, converting this formula into the formula for mutual information, the weighted sum of PMI values for each outcome. So the difference between "log likelihood" and mutual information (pointwise or otherwise) is just a matter of normalization scheme.