Solved – Kneser-Ney for unigrams

language-modelslaplace-smoothingnatural languagesmoothing

I was wondering if it is at all possible to use Kneser-Ney to smooth word unigram probabilites?

The basic idea behind back-off is to use (n-1)-gram frequencies when an n-gram has 0 count. This is obviously hard to do with an unigram. Is there something that I am missing that would allow to use Kneser-Ney for unigrams to smooth probabilities of single words? If this is possible how could that be done? If not, why is that impossible?

Best Answer

Short answer: although it's possible to use it in this strange way, Kneyser-Ney is not designed for smoothing unigrams, because in this case its nothing but additive smoothing: $p_{abs}\left ( w_{i} \right )=\frac{max\left ( c\left ( w_{i} \right )-\delta ,0 \right )}{\sum_{w'}^{ }c(w')}$. This looks similar to Laplace smoothing and it is very well-known fact, that additive smoothing have poor perfomance and why wouldn't it?

Good and Turing revealed better scheme. The idea is to reallocate the probability mass of n-grams that occur $r + 1$ times in the training data to the n-grams that occur $r$ times. In particular, reallocate the probability mass of n-grams that were seen once to the n-grams that were never seen.

For each count $r$, we compute an adjusted count $r^{*}=(r+1)\frac{n_{r+1}}{n_{r}}$, where $n_{r}$ is the number of n-grams seen exactly r times.

Then we have: $p(x:c(x)=r)=\frac{r^{*}}{N}, N=\sum_{1}^{\infty }r*n_{r}$. But many more sophisticated models were invented since then, so you have to do your research.

Long answer: First, let's start with the problem ( if your motivation is to gain deeper understanding of what's going on behind statistical model ).

You have some kind of probabilistic model, which is a distribution $p(e)$ over an event space $E$. You want to estimate the parameters of your model distribution $p$ from data. In principle, you might to like to use maximum likelihood (ML) estimates, so that your model is $p_{ML}\left ( x \right )=\frac{c(x)}{\sum_{e}^{ }c(e)}$

But, you have insufficient data: there are many events $x$ such that $c(x)=0$, so that the ML estimate is $p_{ML}(x)=0$. In case of language models those events are words, which were never seen so far, we don't want to predict their probability to be zero.

Kneser-Ney is very creative method to overcome this bug by smoothing. It's an extension of absolute discounting with a clever way of constructing the lower-order (backoff) model. The idea behind that is simple: the lower-order model is significant only when count is small or zero in the higher-order model, and so should be optimized for that purpose.

Example: suppose “San Francisco” is common, but “Francisco” occurs only after “San”. “Francisco” will get a high unigram probability, and so absolute discounting will give a high probability to “Francisco” appearing after novel bigram histories. Better to give “Francisco” a low unigram probability, because the only time it occurs is after “San”, in which case the bigram model fits well.

For bigram case we have: $p_{abs}(w_{i}|w_{i-1})=\frac{max(c(w_{i}w_{i-1})-\delta,0)}{\sum_{w'}^{ } c(w_{i-1}w')}+\alpha*p_{abs}(w_{i})$, from which is easy to conclude what will happen if we have no context ( i.e. only unigrams ).

Also take a look at classic Chen & Goodman paper for thorough and systematic comparison of many traditional language models : http://u.cs.biu.ac.il/~yogo/courses/mt2013/papers/chen-goodman-99.pdf

Related Question