Dan Jurafsky and Jim Martin have published a chapter on N-Gram models which talks a bit about this problem:
At the termination of the recursion, unigrams are interpolated with the uniform distribution:
$
\begin{align}
P_{KN}(w) = \frac{\max(c_{KN}(w)-d,0)}{\sum_{w'}c_{KN}(w')}+\lambda(\epsilon)\frac{1}{|V|}
\end{align}
$
If we want to include an unknown word <UNK>
, it’s just included as a regular vocabulary entry with count zero, and hence its probability will be:
$
\begin{align}
\frac{\lambda(\epsilon)}{|V|}
\end{align}
$
I've tried to find out what this means, but am not sure if $\epsilon$ just means $\lim_{x\rightarrow0}x$. If this is the case, and you assume that as the count goes to zero, maybe $\lambda(\epsilon)$ goes to $d$, according to:
$
\begin{align}
\lambda(w_{i-1}) = \frac{d}{c(w_{i-1})}\vert\{w:c(w_{i-1},w)>0\}\vert
\end{align}
$
then the unknown word just gets assigned a fraction of the discount, i.e.:
$
\begin{align}
\frac{\lambda(\epsilon)}{|V|} = \frac{d}{|V|}
\end{align}
$
I'm not confident about this answer at all, but wanted to get it out there in case it sparks some more thoughts.
Update:
Digging around some more, it seems like $\epsilon$ is typically used to denote the empty string (""), but it's still not clear how this affects the calculation of $\lambda$. $\frac{d}{|V|}$ is still my best guess
Short answer: although it's possible to use it in this strange way, Kneyser-Ney is not designed for smoothing unigrams, because in this case its nothing but additive smoothing: $p_{abs}\left ( w_{i} \right )=\frac{max\left ( c\left ( w_{i} \right )-\delta ,0 \right )}{\sum_{w'}^{ }c(w')}$. This looks similar to Laplace smoothing and it is very well-known fact, that additive smoothing have poor perfomance and why wouldn't it?
Good and Turing revealed better scheme. The idea is to reallocate the probability mass of n-grams that occur $r + 1$ times in the training data to the n-grams that occur $r$ times. In particular, reallocate the probability mass of n-grams that were seen once to the n-grams that were never seen.
For each count $r$, we compute an adjusted count $r^{*}=(r+1)\frac{n_{r+1}}{n_{r}}$, where $n_{r}$ is the number of n-grams seen exactly r times.
Then we have: $p(x:c(x)=r)=\frac{r^{*}}{N}, N=\sum_{1}^{\infty }r*n_{r}$.
But many more sophisticated models were invented since then, so you have to do your research.
Long answer: First, let's start with the problem ( if your motivation is to gain deeper understanding of what's going on behind statistical model ).
You have some kind of probabilistic model, which is a distribution $p(e)$
over an event space $E$. You want to estimate the parameters of your model distribution $p$ from data. In principle, you might to like to use maximum likelihood (ML) estimates, so that your model is $p_{ML}\left ( x \right )=\frac{c(x)}{\sum_{e}^{ }c(e)}$
But, you have insufficient data: there are many events $x$ such that $c(x)=0$, so that the ML estimate is $p_{ML}(x)=0$.
In case of language models those events are words, which were never seen so far, we don't want to predict their probability to be zero.
Kneser-Ney is very creative method to overcome this bug by smoothing. It's an extension of absolute discounting with a clever way of constructing the lower-order (backoff) model. The idea behind that is simple: the lower-order model is significant only when count is small or zero in the higher-order model, and so should be optimized for that purpose.
Example: suppose “San Francisco” is common, but “Francisco”
occurs only after “San”. “Francisco” will get a high unigram probability, and so absolute discounting will give a high probability to “Francisco” appearing after novel bigram histories. Better to give “Francisco” a low unigram probability, because the only time it occurs is after “San”, in which case the bigram model fits well.
For bigram case we have: $p_{abs}(w_{i}|w_{i-1})=\frac{max(c(w_{i}w_{i-1})-\delta,0)}{\sum_{w'}^{ } c(w_{i-1}w')}+\alpha*p_{abs}(w_{i})$, from which is easy to conclude what will happen if we have no context ( i.e. only unigrams ).
Also take a look at classic Chen & Goodman paper for thorough and systematic comparison of many traditional language models : http://u.cs.biu.ac.il/~yogo/courses/mt2013/papers/chen-goodman-99.pdf
Best Answer
EDIT Feb 2, 2019: I figured I made a mistake below with the continuation count parameter (and with how a wolf sounds). I corrected it but published a new version in this link. If you might find it useful, please follow it there.
I got here willing to find an answer but got no luck :( But then, I had to figure things out on my own! So let me try to explain for those who might seek this kind of information in the future. (Actually, this is a very tentative reply, I am not absolutely sure of my conclusions, but I hope others can build from there).
Suppose we start from the following text (from here):
After turning it into a corpus, tokenizing (punctuation removed, lowercased) and n-gramming (1, 2 and 3 n-grams), we get to the following tables:
Let's search the string (and remember I am calling it "string")
a paragraph
when it is starting a ngram intable3Gram
. We get to the following subset:Now our aim is to calculate
P_KN(is|a paragraph)
andP_KN(can|a paragraph)
(equation 4.35 from the material you cited).For this, let's first decompose that gigantic equation in three parts I will call
firstTerm
,lambda
, andPcont
.firstTerm
is the max(frequency of that ngram minus d, or zero), divided by the frequency of the string. Since this is our highest order ngram, we setd = 0
(that is what equation 4.36 stands for, for more on d, please check the awesome material here and here). So, for wordis
,firstTerm(is) = 4/(4+1) = 0.8
, and for wordcan
,firstTerm(can) = 1/(4+1) = 0.2
.lambda
is not defined up there, but it is:In this case,
d = 0
still, thenlambda = 0
, but just for illustration, let's examine the other components of the equation. The denominator in the equation is the frequency of the string, and the element in the right is the number of different final words types (not their frequencies) given that string, which in our example areis
andcan
. Note thatlambda
is a function of the string, not of the final word, then it is the same for both final words. In our case.lambda = 0/(1+4)*2
.Then we get to
Pcont
, which is also not discriminated up there, but stands for:The numerator means the number of different strings preceding the word, and the denominator means the number of different possible ngram types (the length of
table3Gram
).For the word
is
, we check that it can be preceded by 3 different strings:The word
can
is preceded only by 1 string type:Since
table3Gram
has 117 entries,Pcont(is) = 3/117 = 0.026
andPcont(can) = 1/117 = 0.009
.Pulling it all together:
If we cannot find a match in the highest order ngram (for example, there is no match for
the paragraph
), then we drop down to the next lower order, discarding the first word in the string. So we should search forparagraph
starting an ngram intable2Gram
, which returns:The main difference from the first round is that we must set d to 0.75 (as recommended in the video previously mentioned) - remember, that is what equation 4.36 stands for, as
continuation_count = count - d
. Then,lambda
will be different from 0 and the final probabilities will be slightly corrected.If we drop down all the way to
table1Gram
, then we have to consider that the string preceding the final word is empty, which is equal for all ngrams in that table. If one succeeds in implementing a loop with the rules mentioned above, though, it will take care of this observation on its own.I guess that's it! Owooooo! (that was me howling)