Correlation measures the linear relationship (Pearson's correlation) or monotonic relationship (Spearman's correlation) between two variables, X and Y.
Mutual information is more general and measures the reduction of uncertainty in Y after observing X. It is the KL distance between the joint density and the product of the individual densities. So MI can measure non-monotonic relationships and other more complicated relationships.
Even though they look similar, they are quite different things. Let's start with the major differences.
$h$ is something different in PMI and in WOE
Notice the term $p(h)$ in PMI. This implies that $h$ is a random variable of which you can compute the probability. For a Bayesian, that's no problem, but if you do not believe that hypotheses can have a probability a priori you cannot even write PMI for hypothesis and evidence. In WOE, $h$ is a parameter of the distribution and the expressions are always defined.
PMI is symmetric, WOE is not
Trivially, $pmi(e,h) = pmi(h,e)$. However, $w(h:e) = \log p(h|e)/p(h|\bar{e})$ need not be defined because of the term $\bar{e}$. Even when it is, it is in general not equal to $w(e:h)$.
Other than that, WOE and PMI have similarities.
The weight of evidence says how much the evidence speaks in favor of a hypothesis. If it is 0, it means that it neither speaks for nor against. The higher it is, the more it validates hypothesis $h$, and the lower it is, the more it validates $\bar{h}$.
Mutual information quantifies how the occurrence of an event ($e$ or $h$) says something about the occurrence of the other event. If it is 0, the events are independent and the occurrence of one says nothing about the other. The higher it is the more often they co-occur, and the lower it is the more they are mutually exclusive.
What about the cases where the hypothesis $h$ is also a random variable and both options are valid? For example in communiction over a binary noisy channel, the hypothesis is $h$ the emitted signal to decode and the evidence is the received signal. Say that the probability of flipping is $1/1000$, so if you receive a $1$, the WOE for $1$ is $\log 0.999/0.001 = 6.90$. The PMI, on the other hand, depends on the proability of emitting a $1$. You can verify that when the probability of emitting a $1$ tends to 0, the PMI tends to $6.90$, while it tends to $0$ when the probability of emitting a $1$ tends to $1$.
This paradoxical behavior illustrates two things:
None of them is suitable to make a guess about the emission. If the probability of emitting a $1$ drops below $1/1000$, the most likely emission is $0$ even when receiving a $1$. However, for small probabilities of emitting a $1$ both WOE and PMI are close to $6.90$.
PMI is a gain of (Shannon's) information over the realization of the hypothesis, if the hypothesis is almost sure, then no information is gained. WOE is an update of our prior odds, which does not depend on the value of those odds.
Best Answer
according to Dan Jurafsky and James H. Martin book:
"It turns out, however, that simple frequency isn’t the best measure of association between words. One problem is that raw frequency is very skewed and not very discriminative. If we want to know what kinds of contexts are shared by apricot and pineapple but not by digital and information, we’re not going to get good discrimination from words like the, it, or they, which occur frequently with all sorts of words and aren’t informative about any particular word."
sometimes we replace this raw frequency with positive pointwise mutual information:
$$\text{PPMI}(w,c) = \max{\left(\log_{2}{\frac{P(w,c)}{P(w)P(c)}},0\right)}$$
PMI on its own shows how much it's possible to observe a word w with a context word C compare to observing them independently. In PPMI we only keep positive values of PMI. Let's think about when PMI is + or - and why we only keep negative ones:
What does positive PMI mean?
$\frac{P(w,c)}{(P(w)P(c))} > 1$
$P(w,c) > (P(w)P(c))$
it happens when $w$ and $c$ occur mutually more than individually like kick and ball. We'd like to keep these!
What does negative PMI mean?
$\frac{P(w,c)}{(P(w)P(c))} < 1$
$P(w,c) < (P(w)P(c))$
it means both of $w$ and $c$ or one of them tend to occur individually! It might indicate unreliable stats due to limited data otherwise it shows uninformative co-occurrences e.g., 'the' and 'ball'. ('the' occurs with most of the words too.)
PMI or particularly PPMI helps us to catch such situations with informative co-occurrence.