Dan Jurafsky and Jim Martin have published a chapter on N-Gram models which talks a bit about this problem:
At the termination of the recursion, unigrams are interpolated with the uniform distribution:
$
\begin{align}
P_{KN}(w) = \frac{\max(c_{KN}(w)-d,0)}{\sum_{w'}c_{KN}(w')}+\lambda(\epsilon)\frac{1}{|V|}
\end{align}
$
If we want to include an unknown word <UNK>
, it’s just included as a regular vocabulary entry with count zero, and hence its probability will be:
$
\begin{align}
\frac{\lambda(\epsilon)}{|V|}
\end{align}
$
I've tried to find out what this means, but am not sure if $\epsilon$ just means $\lim_{x\rightarrow0}x$. If this is the case, and you assume that as the count goes to zero, maybe $\lambda(\epsilon)$ goes to $d$, according to:
$
\begin{align}
\lambda(w_{i-1}) = \frac{d}{c(w_{i-1})}\vert\{w:c(w_{i-1},w)>0\}\vert
\end{align}
$
then the unknown word just gets assigned a fraction of the discount, i.e.:
$
\begin{align}
\frac{\lambda(\epsilon)}{|V|} = \frac{d}{|V|}
\end{align}
$
I'm not confident about this answer at all, but wanted to get it out there in case it sparks some more thoughts.
Update:
Digging around some more, it seems like $\epsilon$ is typically used to denote the empty string (""), but it's still not clear how this affects the calculation of $\lambda$. $\frac{d}{|V|}$ is still my best guess
Best Answer
Short answer: although it's possible to use it in this strange way, Kneyser-Ney is not designed for smoothing unigrams, because in this case its nothing but additive smoothing: $p_{abs}\left ( w_{i} \right )=\frac{max\left ( c\left ( w_{i} \right )-\delta ,0 \right )}{\sum_{w'}^{ }c(w')}$. This looks similar to Laplace smoothing and it is very well-known fact, that additive smoothing have poor perfomance and why wouldn't it?
Good and Turing revealed better scheme. The idea is to reallocate the probability mass of n-grams that occur $r + 1$ times in the training data to the n-grams that occur $r$ times. In particular, reallocate the probability mass of n-grams that were seen once to the n-grams that were never seen.
For each count $r$, we compute an adjusted count $r^{*}=(r+1)\frac{n_{r+1}}{n_{r}}$, where $n_{r}$ is the number of n-grams seen exactly r times.
Then we have: $p(x:c(x)=r)=\frac{r^{*}}{N}, N=\sum_{1}^{\infty }r*n_{r}$. But many more sophisticated models were invented since then, so you have to do your research.
Long answer: First, let's start with the problem ( if your motivation is to gain deeper understanding of what's going on behind statistical model ).
You have some kind of probabilistic model, which is a distribution $p(e)$ over an event space $E$. You want to estimate the parameters of your model distribution $p$ from data. In principle, you might to like to use maximum likelihood (ML) estimates, so that your model is $p_{ML}\left ( x \right )=\frac{c(x)}{\sum_{e}^{ }c(e)}$
But, you have insufficient data: there are many events $x$ such that $c(x)=0$, so that the ML estimate is $p_{ML}(x)=0$. In case of language models those events are words, which were never seen so far, we don't want to predict their probability to be zero.
Kneser-Ney is very creative method to overcome this bug by smoothing. It's an extension of absolute discounting with a clever way of constructing the lower-order (backoff) model. The idea behind that is simple: the lower-order model is significant only when count is small or zero in the higher-order model, and so should be optimized for that purpose.
Example: suppose “San Francisco” is common, but “Francisco” occurs only after “San”. “Francisco” will get a high unigram probability, and so absolute discounting will give a high probability to “Francisco” appearing after novel bigram histories. Better to give “Francisco” a low unigram probability, because the only time it occurs is after “San”, in which case the bigram model fits well.
For bigram case we have: $p_{abs}(w_{i}|w_{i-1})=\frac{max(c(w_{i}w_{i-1})-\delta,0)}{\sum_{w'}^{ } c(w_{i-1}w')}+\alpha*p_{abs}(w_{i})$, from which is easy to conclude what will happen if we have no context ( i.e. only unigrams ).
Also take a look at classic Chen & Goodman paper for thorough and systematic comparison of many traditional language models : http://u.cs.biu.ac.il/~yogo/courses/mt2013/papers/chen-goodman-99.pdf