Solved – Understanding Add-1/Laplace smoothing with bigrams

language-modelslaplace-smoothingmachine learningnatural languageprobability

I am working through an example of Add-1 smoothing in the context of NLP

Say that there is the following corpus (start and end tokens included)

+ I am sam -
+ sam I am -
+ I do not like green eggs and ham -

I want to check the probability that the following sentence is in that small corpus, using bigrams

+ I am sam green -

Normally, the probability would be found by:

P(+|i)*P(am|i)*P(sam|am)*P(green|sam)*P(-|green)

Which would be:

(2/3)*(2/3)*(1/2)*(0/2)*(0/1) = 0

To try to alleviate this, I would do the following:

(Count(W[i-1]W[i])+1)/(Count(W[i-1])+V)

Where V is the sum of the types in the searched sentence as they exist in the corpus, in this instance:

V=Count(+)+Count(i)+Count(am)+Count(sam)+Count(green)=3+3+2+2+1=11

This turns out to be:

(3/14)*(3/14)*(2/13)*(1/13)*(1/12)~= 0.000045

Now, say I want to see the probability that the following sentence is in the small corpus:

+ I am mark Johnson -

A normal probability will be undefined (0/0).

(2/3)*(2/3)*(0/2)*(0/0)*(0/0)

Going straight to the smoothing portion:

V = Count(+)+Count(i)+Count(am)+Count(mark)+Count(johnson)=3+3+2+0+0=8

(3/11)*(3/11)*(1/10)*(1/8)*(1/8)~=0.00012

I fail to understand how this can be the case, considering "mark" and "johnson" are not even present in the corpus to begin with. Is this a special case that must be accounted for? Or is this just a caveat to the add-1/laplace smoothing method?

Do I just have the wrong value for V (i.e. should I add 1 for a non-present word, which would make V=10 to account for "mark" and "johnson")? If this is the case (it almost makes sense to me that this would be the case), then would it be the following:

(3/13)*(3/13)*(1/12)*(1/10)*(1/10)~=0.000044

Moreover, what would be done with, say, a sentence like:

+ yo soy mark johnson -

Would it be (assuming that I just add the word to the corpus):

V=3+1+1+1
(0/3)*(1/7)*(1/7)*(1/7)*(1/7)~=0.00042

Best Answer

I know this question is old and I'm answering this for other people who may have the same question. You had the wrong value for V. V is the vocabulary size which is equal to the number of unique words (types) in your corpus. Here V=12.

Related Question