Solved – Why do people use the term “weight of evidence” and how does it differ from “pointwise mutual information”

bayesianmutual informationprobability

Here, "weight of evidence" (WOE) is a common term in the published scientific and policy-making literature, most often seen in the context of risk assessment, defined by:

$$w(e : h) = \log\frac{p(e|h)}{p(e|\overline{h})}$$

where $e$ is evidence, $h$ is hypothesis.

Now, I want to know what is the main difference with PMI (pointwise mutual information)

$$pmi(e,h)=\log\frac{p(e,h)}{p(e)*p(h)}$$

Best Answer

Even though they look similar, they are quite different things. Let's start with the major differences.

$h$ is something different in PMI and in WOE
Notice the term $p(h)$ in PMI. This implies that $h$ is a random variable of which you can compute the probability. For a Bayesian, that's no problem, but if you do not believe that hypotheses can have a probability a priori you cannot even write PMI for hypothesis and evidence. In WOE, $h$ is a parameter of the distribution and the expressions are always defined.
PMI is symmetric, WOE is not
Trivially, $pmi(e,h) = pmi(h,e)$. However, $w(h:e) = \log p(h|e)/p(h|\bar{e})$ need not be defined because of the term $\bar{e}$. Even when it is, it is in general not equal to $w(e:h)$.

Other than that, WOE and PMI have similarities.

The weight of evidence says how much the evidence speaks in favor of a hypothesis. If it is 0, it means that it neither speaks for nor against. The higher it is, the more it validates hypothesis $h$, and the lower it is, the more it validates $\bar{h}$.

Mutual information quantifies how the occurrence of an event ($e$ or $h$) says something about the occurrence of the other event. If it is 0, the events are independent and the occurrence of one says nothing about the other. The higher it is the more often they co-occur, and the lower it is the more they are mutually exclusive.

What about the cases where the hypothesis $h$ is also a random variable and both options are valid? For example in communiction over a binary noisy channel, the hypothesis is $h$ the emitted signal to decode and the evidence is the received signal. Say that the probability of flipping is $1/1000$, so if you receive a $1$, the WOE for $1$ is $\log 0.999/0.001 = 6.90$. The PMI, on the other hand, depends on the proability of emitting a $1$. You can verify that when the probability of emitting a $1$ tends to 0, the PMI tends to $6.90$, while it tends to $0$ when the probability of emitting a $1$ tends to $1$.

This paradoxical behavior illustrates two things:

None of them is suitable to make a guess about the emission. If the probability of emitting a $1$ drops below $1/1000$, the most likely emission is $0$ even when receiving a $1$. However, for small probabilities of emitting a $1$ both WOE and PMI are close to $6.90$.
PMI is a gain of (Shannon's) information over the realization of the hypothesis, if the hypothesis is almost sure, then no information is gained. WOE is an update of our prior odds, which does not depend on the value of those odds.

Related Solutions

Solved – Difference between pointwise mutual information and log likelihood ratio

They're very closely related. "Log-likelihood" is not a very useful name, because it sounds like it's going to be a log probability. But it's not that at all. "Log likelihood ratio" as used at times in the original paper is more correct, and that immediately hints that we should expect to see this metric as a comparison (ratio) between two different probabilities. Furthermore, the formula is structured as a weighted average of logarithms, just like the basic formula for entropy ($\sum_p p \log(p)$). These all strongly hint that this metric is closely related to an information-theoretic metric like mutual information.

Here's the formula for log likelihood given $p_n$, $k_n$, and $n_n$ as it appears in the paper:

$$ \mathrm{logL}(p_n, n_n, k_n) = k_n \log(p_n) + (n_n - k_n) \log (1 - p_n) $$

As the paper itself notes, $p$ is just $k / n$, so we need only normalize by $n$ to get the entropy formula for a Bernoulli distribution (coin flips with a weighted coin, with probability $p$ of turning up heads):

$$ \begin{align} \mathrm{BernoulliEntropy}(p, n, k) & = \frac{k}{n} \log(p) + \frac{(n - k)}{n} \log (1 - p) \\ \mathrm{BernoulliEntropy}(p) & = p \log(p) + (1 - p)\log(1 - p) \end{align} $$

This normalization is the only real difference, and I find it a little puzzling that the author didn't adopt this simplified approach.

The other equation we need (using this new $n$-normalized formulation) is the formula for cross-entropy. It is almost identical, but it compares two different probabilities: it gives us a way to measure the "cost" of representing one probability distribution with another. (Concretely, suppose you compress data from one distribution with an encoding optimized for another distribution; this tells you how many bits it will take on average.)

$$ \mathrm{CrossEntropy}(p,p') = p \log(p') + (1 - p) \log (1 - p') $$

Note that if $p$ and $p'$ are the same, then this winds up being identical to Bernoulli entropy.

$$ \mathrm{BernoulliEntropy}(p) = \mathrm{CrossEntropy}(p, p) $$

To put these formulas together, we just have to specify the exact comparison we want to make. Suppose we have data from two different coin-flipping sessions, and we want to know whether the same coin was used in both sessions, or whether there were two different coins with different weights.

We propose a null hypothesis: the coin used was the same. To test that hypothesis, we need to perform a total of eight weighted log calculations. Four of them will be simple entropy calculations, while the other four will be cross-entropy calculations; the difference will be in the probabilistic model we use. We will use the data for each session to calculate two different probabilities: $p_{1h}$ and $p_{2h}$. Then we will use all the data to calculate just one combined probability $p_{ch}$. The cross-entropy calculations will compare session probabilities to combined probabilities; the entropy calculations will compare session probabilities to themselves. Under the null hypothesis, the values will be the same, and will cancel each other out.

Recall that the logarithm of a probability is always negative; for clarity later, the formulas below are negated so that they give positive values.

Entropy calculations:
- First session
  - Heads: $-p_{1h} \log(p_{1h})$
  - Tails: $-(1 - p_{1h}) \log(1 - p_{1h})$
- Second session
  - Heads: $-p_{2h} \log(p_{2h})$
  - Tails: $-(1 - p_{2h}) \log(1 - p_{2h})$
Cross-entropy calculations:
- First session
  - Heads: $-p_{1h} \log(p_{ch})$
  - Tails: $-(1 - p_{1h}) \log(1 - p_{ch})$
- Second session
  - Heads: $-p_{2h} \log(p_{ch})$
  - Tails: $-(1 - p_{2h}) \log(1 - p_{ch})$

Cross entropy is always equal to or greater than entropy, so we want to subtract entropy from cross entropy to get a meaningful value. If they are close to equal, then the result will be zero, and we accept the null hypothesis. If not, then the value is guaranteed to be positive, and for larger and larger values, we will be more and more inclined to reject the null hypothesis.

Recall that logarithms allow us to convert subtraction into division, so subtracting the first entropy value from the first cross entropy value gives

$$ \begin{align} & -p_{1h} \log(p_{ch}) - -p_{1h} \log(p_{1h}) \\ =\ & p_{1h} \log(p_{1h}) - p_{1h} \log(p_{ch}) \\ =\ & p_{1h} \log\frac{p_{1h}}{p_{ch}} \\ \end{align} $$

Combining all the terms and logarithms together, we get this definition of the combined, normalized log likelihood ratio, $ \mathrm{logL'} $:

$$ \begin{align} \mathrm{logL'} &=\ p_{1h}\ \log\frac{p_{1h}}{p_{ch}} \\ &+ (1 - p_{1h})\ \log\frac{(1 - p_{1h})}{(1 - p_{ch})} \\ &+ p_{2h}\ \log\frac{p_{2h}}{p_{ch}} \\ &+ (1 - p_{2h})\ \log\frac{(1 - p_{2h})}{(1 - p_{ch})} \\ \end{align} $$

At this point, converting this to PMI is just a matter of reinterpreting the notation. $p_{1h}$ is the probability of turning up heads in the first session, and so we could also call it the conditional probability of heads given that we're looking only at data from the first session. We can do similar things to the other probabilities from each session:

$$ \begin{align} p_{1h} &= \mathrm{P}(h|c_1) \\ (1 - p_{1h}) &= \mathrm{P}(t|c_1) \\ p_{2h} &= \mathrm{P}(h|c_2) \\ (1 - p_{2h}) &= \mathrm{P}(t|c_2) \end{align} $$

The null hypothesis probabilities are not conditional but "prior" probabilities -- probabilities calculated without taking into account additional information about the sessions:

$$ \begin{align} p_{ch} &= \mathrm{P}(h) \\ 1-p_{ch} &= \mathrm{P}(t) \\ \end{align} $$

Now, by the definition of conditional probability we have

$$ \begin{align} \frac{\mathrm{P}(h,c)}{\mathrm{P}(c)} &= \mathrm{P}(h|c) \\ \frac{\mathrm{P}(h,c)}{\mathrm{P}(c)\mathrm{P}(h)} &= \frac{\mathrm{P}(h|c)}{\mathrm{P}(h)} \end{align} $$

And we see that by converting the $\mathrm{logL'}$ formulas to use conditional probability notation, we have (for example)

$$ \begin{align} & \mathrm{P}(h|c_1)\ \log\frac{\mathrm{P}(h|c_1)}{\mathrm{P}(h)} \\ \end{align} $$

This is exactly the (pointwise) formula for $\mathrm{D_{KL}}((h|c_1)\ ||\ h)$, the KL divergence of the conditional distribution of heads in session one from the prior distribution of heads. So the log likelihood, when normalized by the number of trials in each session, is the same as the sum, for each possible outcome, of KL divergences of conditional distributions from prior distributions. If you understand KL divergence, this provides a good intuition for how this test works: it measures the "distance" between the conditional and unconditional probabilities for each outcome. If the difference is large, then the null hypothesis is probably false.

The relationship between mutual information and KL divergence is well known. So we're nearly done. Starting from the above formula, we have

$$ \begin{align} & \mathrm{P}(h|c_1)\ \log\frac{\mathrm{P}(h|c_1)}{\mathrm{P}(h)} \\ =\ & \mathrm{P}(h|c_1)\ \log\frac{\mathrm{P}(h,c_1)}{\mathrm{P}(c_1)\mathrm{P}(h)} \\ =\ & \mathrm{P}(h|c_1) \cdot \mathrm{PMI}(h; c_1) \end{align} $$

Where the last version is based on the definition of Pointwise Mutual Information (as given here). Putting it all together:

$$ \begin{align} \mathrm{logL'} &= \mathrm{P}(h|c_1) \cdot \mathrm{PMI}(h; c_1) \\ &+ \mathrm{P}(t|c_1) \cdot \mathrm{PMI}(t; c_1) \\ &+ \mathrm{P}(h|c_2) \cdot \mathrm{PMI}(h; c_2) \\ &+ \mathrm{P}(t|c_2) \cdot \mathrm{PMI}(t; c_2) \\ \end{align} $$

We could recover the pre-normalized version by using total counts from the first and second sessions: multiplying $\mathrm{P}(h|c_n)$ by the number of trials in session $n$ gives the number of heads in session $n$, which recovers the original definition of $\mathrm{logL}$ at the top of this answer.

Dividing that number by the total number of trials would give $\mathrm{P}(h,c_n)$, converting this formula into the formula for mutual information, the weighted sum of PMI values for each outcome. So the difference between "log likelihood" and mutual information (pointwise or otherwise) is just a matter of normalization scheme.

Solved – How to calculate mutual information from frequencies

I'm not a stats specialist, but I will give it a shot.

First, we can approximate the probability of each event by its empirical probability, i.e. the number of occurrences divided by the total number of trials:

$p(motif_i, condition_j) = \frac{\text{number of occurrences of motif i with condition j}}{ \sum_{i,j} \text{number of occurrences of motif i with condition j}}$

I'll use the shorthands m_1, m_2, c_1, c_2 for motifs and conditions in your table. The approximation gives the following joint distribution $p(m_i,c_j)$:

     c_1  c_2
m_1  0.1 0.05
m_2  0.4 0.45

Marginal probabilities can be computed by just summing rows and columns. Have a look at the example there: https://en.wikipedia.org/wiki/Marginal_distribution I.e. here, $p(m_1)=0.15$ and $p(c_1)=0.5$.

Then, the mutual information can be computed from its definition:

$I(motif;condition) = \sum_{i \in [1,2], j \in [1,2]} p(m_i,c_j)\log(\frac{p(m_i,c_j)}{p(m_i)p(c_j)})$

Best Answer

Related Solutions

Solved – Difference between pointwise mutual information and log likelihood ratio

Solved – How to calculate mutual information from frequencies

Related Question