Solved – Language Model compare probability scores between Length varying sentence

language-modelsnatural languagenormalizationperplexityprobability

My question is : How can I compare Language Model(LM) score for two sentences with different lengths ?

Probabilities are < 1, and since LM scores for a sentence are multiple of probability of bigram or trigram, depending upon it's a bigram or trigram model, the probability of scores of longer sentences will mostly be smaller.

So, how should I normalize the value of scores according to length ?

I am pretty sure, atmost everyone after reading LM would have had same doubt. But I couldn't find much on internet.

Would appreciate for any leads on this.

Best Answer

As you noticed, it's good idea to have some kind of averaging. Since in LM probabilities get multiplied, geometric average seems like a good fit.

From Speech and Language Processing

In practice we don’t use raw probability as our metric for evaluating language models, but a variant called perplexity. The perplexity (sometimes called PP for short) of a language model on a test set is the inverse probability of the test set, normalized by the number of words.

$PP((w_1, ...,w_N)) = \sqrt[N]{\dfrac{1}{P(w_1, ...,w_N)}}$

Related Question