Solved – In Kneser-Ney smoothing, how are unseen words handled

language-modelsmachine learningnaive bayesnatural languagesmoothing

From what I have seen, the (second-order) Kneser-Ney smoothing formula is in some way or another given as

$
\begin{align}
P^2_{KN}(w_n|w_{n-1}) &= \frac{\max \left\{ C\left(w_{n-1}, w_n\right) – D, 0\right\}}{\sum_{w'} C\left(w_{n-1}, w'\right)} + \lambda(w_{n-1}) \times P_{cont}(w_n)
\end{align}
$

with the normalizing factor $\lambda(w_{n-1})$ given as

$
\begin{align}
\lambda(w_{n-1}) &= \frac{D}{\sum_{w'} C\left(w_{n-1}, w'\right)} \times N_{1+}\left(w_{n-1}\bullet\right)
\end{align}
$

and the continuation probability $P_{cont}(w_n)$ of a word $w_n$

$
\begin{align}
P_{cont}(w_n) &= \frac{N_{1+}\left(\bullet w_{n}\right)}{\sum_{w'} N_{1+}\left(\bullet w'\right)}
\end{align}
$

where $N_{1+}\left(\bullet w\right)$ is the number of contexts $w$ was seen in or, simplier, the number of distinct words $\bullet$ that precede the given word $w$. From what I've understood, the formula can be applied recursively.

Now this handles known words in unknown contexts nicely for different n-gram lengths, but what it doesn't explain is what to do when there are out-of-dictionary words. I tried following this example which states that in the recursion step for unigrams, $P_{cont}(/) = P^0_{KN}(/) = \frac{1}{V}$. The document then uses this – quoting Chen and Goodman – to justify the above formula as $P^1_{KN}(w) = P_{cont}(w)$.

I fail to see how it works out in the presence of an unknown word $w = \text{unknown}$ though. In these cases $P_{cont}(\text{unknown}) = \frac{0}{\text{something}}$ since, obviously, the unknown word doesn't continue anything regarding the training set. Likewise the count of n-grams is going to be $C\left(w_{n-1}, \text{unknown}\right) = 0$.

Furthermore, the whole $\sum_{w'} C\left(w_{n-1}, w'\right)$ term might be zero if a sequence of unknown words – say, a trigram of OOD words – is encountered.

What am I missing?

Best Answer

Dan Jurafsky and Jim Martin have published a chapter on N-Gram models which talks a bit about this problem:

At the termination of the recursion, unigrams are interpolated with the uniform distribution:

$ \begin{align} P_{KN}(w) = \frac{\max(c_{KN}(w)-d,0)}{\sum_{w'}c_{KN}(w')}+\lambda(\epsilon)\frac{1}{|V|} \end{align} $

If we want to include an unknown word <UNK>, it’s just included as a regular vocabulary entry with count zero, and hence its probability will be:

$ \begin{align} \frac{\lambda(\epsilon)}{|V|} \end{align} $

I've tried to find out what this means, but am not sure if $\epsilon$ just means $\lim_{x\rightarrow0}x$. If this is the case, and you assume that as the count goes to zero, maybe $\lambda(\epsilon)$ goes to $d$, according to:

$ \begin{align} \lambda(w_{i-1}) = \frac{d}{c(w_{i-1})}\vert\{w:c(w_{i-1},w)>0\}\vert \end{align} $

then the unknown word just gets assigned a fraction of the discount, i.e.:

$ \begin{align} \frac{\lambda(\epsilon)}{|V|} = \frac{d}{|V|} \end{align} $

I'm not confident about this answer at all, but wanted to get it out there in case it sparks some more thoughts.

Update: Digging around some more, it seems like $\epsilon$ is typically used to denote the empty string (""), but it's still not clear how this affects the calculation of $\lambda$. $\frac{d}{|V|}$ is still my best guess