Even though they look similar, they are quite different things. Let's start with the major differences.
$h$ is something different in PMI and in WOE
Notice the term $p(h)$ in PMI. This implies that $h$ is a random variable of which you can compute the probability. For a Bayesian, that's no problem, but if you do not believe that hypotheses can have a probability a priori you cannot even write PMI for hypothesis and evidence. In WOE, $h$ is a parameter of the distribution and the expressions are always defined.
PMI is symmetric, WOE is not
Trivially, $pmi(e,h) = pmi(h,e)$. However, $w(h:e) = \log p(h|e)/p(h|\bar{e})$ need not be defined because of the term $\bar{e}$. Even when it is, it is in general not equal to $w(e:h)$.
Other than that, WOE and PMI have similarities.
The weight of evidence says how much the evidence speaks in favor of a hypothesis. If it is 0, it means that it neither speaks for nor against. The higher it is, the more it validates hypothesis $h$, and the lower it is, the more it validates $\bar{h}$.
Mutual information quantifies how the occurrence of an event ($e$ or $h$) says something about the occurrence of the other event. If it is 0, the events are independent and the occurrence of one says nothing about the other. The higher it is the more often they co-occur, and the lower it is the more they are mutually exclusive.
What about the cases where the hypothesis $h$ is also a random variable and both options are valid? For example in communiction over a binary noisy channel, the hypothesis is $h$ the emitted signal to decode and the evidence is the received signal. Say that the probability of flipping is $1/1000$, so if you receive a $1$, the WOE for $1$ is $\log 0.999/0.001 = 6.90$. The PMI, on the other hand, depends on the proability of emitting a $1$. You can verify that when the probability of emitting a $1$ tends to 0, the PMI tends to $6.90$, while it tends to $0$ when the probability of emitting a $1$ tends to $1$.
This paradoxical behavior illustrates two things:
None of them is suitable to make a guess about the emission. If the probability of emitting a $1$ drops below $1/1000$, the most likely emission is $0$ even when receiving a $1$. However, for small probabilities of emitting a $1$ both WOE and PMI are close to $6.90$.
PMI is a gain of (Shannon's) information over the realization of the hypothesis, if the hypothesis is almost sure, then no information is gained. WOE is an update of our prior odds, which does not depend on the value of those odds.
Mutual information $I(X, Y)$ can be thought as a measure of reduction in uncertainty about $X$ after observing $Y$:
$$ I(X, Y) = H(X) - H(X|Y)$$
where $H(X)$ is entropy of $X$ and $H(X|Y)$ is conditional entropy of $X$ given $Y$. By symmetry it follows that
$$ I(X, Y) = H(Y) - H(Y|X)$$
However mutual information of a variable with itself is equal to entropy of this variable
$$ I(X, X) = H(X)$$
and is called self-information. This is true since $H(X|Y) = 0$ if values of $X$ are completely determined by $Y$ and this is true for $H(X|X)$. It is so because entropy is a measure of uncertainty and there is no uncertainty in reasoning on values of $X$ given the values of $X$, so
$$ X(X) - X(X|X) = X(X) - 0 = H(X) $$
This is immediately obvious if you think of it in terms of Venn diagrams as illustrated below.
You can also show this using the formula for mutual information and substituting the conditional entropy part, i.e.
$$ H(X|Y) = \sum_{x \in X, y \in Y} p(x, y) \log \frac{p(x,y)}{p(x)} $$
by changing $y$'s into $x$'s and with recalling that $X \cap X = X$, so $p(x, x) = p(x)$. [Notice that this is an informal argumentation, since for continuous variables $p(x, x)$ would not have a density function, while having cumulative distribution function.]
So yes, if you know something about $X$, then learning again about $X$ gives you no more information.
Check Chapter 2 of Elements of Information Theory by Cover and Thomas, or Shanon's original 1948 paper itself for learning more.
As about your second question, this is a common problem that in your data you do observe some values that possibly can occur. In this case the classical estimator for probability, i.e.
$$ \hat p = \frac{n_i}{\sum_i n_i} $$
where $n_i$ is a number of occurrences of $i$th value (out of $d$ categories), gives you $\hat p = 0$ if $n_i = 0$. This is called zero-frequency problem. The easy and commonly applied fix is, as your professor told you, to add some constant $\beta$ to your counts, so that
$$ \hat p = \frac{n_i + \beta}{(\sum_i n_i) + d\beta} $$
The common choice for $\beta$ is $1$, i.e. applying uniform prior based on Laplace's rule of succession, $1/2$ for Krichevsky-Trofimov estimate, or $1/d$ for Schurmann-Grassberger (1996) estimator. Notice however that what you do here is you apply out-of-data (prior) information in your model, so it gets subjective, Bayesian flavor. With using this approach you have to remember of assumptions you made and take them into consideration.
This approach is commonly used, e.g. in R enthropy package. You can find some further information in the following paper:
Schurmann, T., and P. Grassberger. (1996). Entropy estimation of symbol sequences. Chaos, 6, 41-427.
Best Answer
From Wikipedia entry on pointwise mutual information:
Why does it happen? Well, the definition for pointwise mutual information is
$$ pmi \equiv \log \left[ \frac{p(x,y)}{p(x)p(y)} \right] = \log p(x,y) - \log p(x) - \log p(y), $$
whereas for normalized pointwise mutual information is:
$$ npmi \equiv \frac{pmi}{-\log p(x,y)} = \frac{\log[ p(x) p(y)]}{\log p(x,y)} - 1. $$
The when there are: