I am working through an example of Add-1 smoothing in the context of NLP
Say that there is the following corpus (start and end tokens included)
+ I am sam -
+ sam I am -
+ I do not like green eggs and ham -
I want to check the probability that the following sentence is in that small corpus, using bigrams
+ I am sam green -
Normally, the probability would be found by:
P(+|i)*P(am|i)*P(sam|am)*P(green|sam)*P(-|green)
Which would be:
(2/3)*(2/3)*(1/2)*(0/2)*(0/1) = 0
To try to alleviate this, I would do the following:
(Count(W[i-1]W[i])+1)/(Count(W[i-1])+V)
Where V is the sum of the types in the searched sentence as they exist in the corpus, in this instance:
V=Count(+)+Count(i)+Count(am)+Count(sam)+Count(green)=3+3+2+2+1=11
This turns out to be:
(3/14)*(3/14)*(2/13)*(1/13)*(1/12)~= 0.000045
Now, say I want to see the probability that the following sentence is in the small corpus:
+ I am mark Johnson -
A normal probability will be undefined (0/0).
(2/3)*(2/3)*(0/2)*(0/0)*(0/0)
Going straight to the smoothing portion:
V = Count(+)+Count(i)+Count(am)+Count(mark)+Count(johnson)=3+3+2+0+0=8
(3/11)*(3/11)*(1/10)*(1/8)*(1/8)~=0.00012
I fail to understand how this can be the case, considering "mark" and "johnson" are not even present in the corpus to begin with. Is this a special case that must be accounted for? Or is this just a caveat to the add-1/laplace smoothing method?
Do I just have the wrong value for V (i.e. should I add 1 for a non-present word, which would make V=10 to account for "mark" and "johnson")? If this is the case (it almost makes sense to me that this would be the case), then would it be the following:
(3/13)*(3/13)*(1/12)*(1/10)*(1/10)~=0.000044
Moreover, what would be done with, say, a sentence like:
+ yo soy mark johnson -
Would it be (assuming that I just add the word to the corpus):
V=3+1+1+1
(0/3)*(1/7)*(1/7)*(1/7)*(1/7)~=0.00042
Best Answer
I know this question is old and I'm answering this for other people who may have the same question. You had the wrong value for V. V is the vocabulary size which is equal to the number of unique words (types) in your corpus. Here V=12.