Solved – A simple numerical example for Kneser-Ney Smoothing

language-modelslaplace-smoothingnaive bayesnatural languagesmoothing

I'm working in a project trying to implement the Kneser-Key algorithm. I think I got up to the step of implementing this formula for bigrams:

$P_{(KN)}(w_i|w_{i-1}) = \frac{max(c(w_{-1}, w_{1}) –
\delta, 0)}{\sum_{w'}{c(w_{i-1}, w')}} + \lambda_{w_{i-1}}P_{continuation}(w_i) $

But this is not the final formula. The final formula includes a recursion, where you consider more than 2 gram levels. I pretend to use from 4grams to unigrams in my code. So, the formula with the recursion is as follows:

enter image description here

Source: https://lagunita.stanford.edu/c4x/Engineering/CS-224N/asset/slp4.pdf

Can anyone make a simple example based on a small database?
I need to compute by hand to understand it.

Best Answer

EDIT Feb 2, 2019: I figured I made a mistake below with the continuation count parameter (and with how a wolf sounds). I corrected it but published a new version in this link. If you might find it useful, please follow it there.

I got here willing to find an answer but got no luck :( But then, I had to figure things out on my own! So let me try to explain for those who might seek this kind of information in the future. (Actually, this is a very tentative reply, I am not absolutely sure of my conclusions, but I hope others can build from there).

Suppose we start from the following text (from here):

exampleText = "Paragraphs are the building blocks of papers. Many students define paragraphs in terms of length: a paragraph is a group of at least five sentences, a paragraph is half a page long, etc. In reality, though, the unity and coherence of ideas among sentences is what constitutes a paragraph. A paragraph is defined as “a group of sentences or a single sentence that forms a unit”. Length and appearance do not determine whether a section in a paper is a paragraph. For instance, in some styles of writing, particularly journalistic styles, a paragraph can be just one sentence long. Ultimately, a paragraph is a sentence or group of sentences that support one main idea. In this handout, we will refer to this as the “controlling idea,” because it controls what happens in the rest of the paragraph."

After turning it into a corpus, tokenizing (punctuation removed, lowercased) and n-gramming (1, 2 and 3 n-grams), we get to the following tables:

# table1Gram:
           ngram frequency
 1:            a        15
 2:           of         8
 3:    paragraph         8
 4:           in         6
 5:           is         6
---
72:      because         1
73:           it         1
74:     controls         1
75:      happens         1
76:         rest         1

# table2Gram:
             ngram frequency
  1:   a paragraph         7
  2:  paragraph is         4
  3:          is a         3
  4:      group of         3
  5:       a group         2
 ---                        
111:        in the         1
112:      the rest         1
113:       rest of         1
114:        of the         1
115: the paragraph         1

 # table3Gram:
                  ngram frequency
  1:     a paragraph is         4
  2:     paragraph is a         2
  3:         a group of         2
  4: group of sentences         2
  5: paragraphs are the         1
 ---                             
113:     happens in the         1
114:        in the rest         1
115:        the rest of         1
116:        rest of the         1
117:   of the paragraph         1

Let's search the string (and remember I am calling it "string") a paragraph when it is starting a ngram in table3Gram. We get to the following subset:

                     ngram frequency
1:          a paragraph is         4
2:         a paragraph can         1

Now our aim is to calculate P_KN(is|a paragraph) and P_KN(can|a paragraph) (equation 4.35 from the material you cited).

For this, let's first decompose that gigantic equation in three parts I will call firstTerm, lambda, and Pcont.

firstTerm is the max(frequency of that ngram minus d, or zero), divided by the frequency of the string. Since this is our highest order ngram, we set d = 0 (that is what equation 4.36 stands for, for more on d, please check the awesome material here and here). So, for word is, firstTerm(is) = 4/(4+1) = 0.8, and for word can, firstTerm(can) = 1/(4+1) = 0.2.

lambda is not defined up there, but it is:

In this case, d = 0 still, then lambda = 0, but just for illustration, let's examine the other components of the equation. The denominator in the equation is the frequency of the string, and the element in the right is the number of different final words types (not their frequencies) given that string, which in our example are is and can. Note that lambda is a function of the string, not of the final word, then it is the same for both final words. In our case. lambda = 0/(1+4)*2.

Then we get to Pcont, which is also not discriminated up there, but stands for:

The numerator means the number of different strings preceding the word, and the denominator means the number of different possible ngram types (the length of table3Gram).

For the word is, we check that it can be preceded by 3 different strings:

                ngram frequency
1:     a paragraph is         4
2:         a paper is         1
3: among sentences is         1

The word can is preceded only by 1 string type:

    ngram frequency
1: a paragraph can         1

Since table3Gram has 117 entries, Pcont(is) = 3/117 = 0.026 and Pcont(can) = 1/117 = 0.009.

Pulling it all together:

P(is|a paragraph) = 0.8 + 0 * 0.026 = 0.8
P(can|a paragraph) = 0.2 + 0 * 0.009 = 0.2

If we cannot find a match in the highest order ngram (for example, there is no match for the paragraph), then we drop down to the next lower order, discarding the first word in the string. So we should search for paragraph starting an ngram in table2Gram, which returns:

            ngram frequency
1:   paragraph is         4
2: paragraphs are         1
3:  paragraphs in         1
4:  paragraph can         1

The main difference from the first round is that we must set d to 0.75 (as recommended in the video previously mentioned) - remember, that is what equation 4.36 stands for, as continuation_count = count - d. Then, lambda will be different from 0 and the final probabilities will be slightly corrected.

If we drop down all the way to table1Gram, then we have to consider that the string preceding the final word is empty, which is equal for all ngrams in that table. If one succeeds in implementing a loop with the rules mentioned above, though, it will take care of this observation on its own.

I guess that's it! Owooooo! (that was me howling)

Related Question