Solved – Why does lower perplexity indicate better generalization performance

latent-dirichlet-allocmachine learningmodel selectiontopic-models

In the seminal paper on Latent Dirichlet Allocation, the authors state that,

A lower perplexity score indicates better generalization performance.

$perplexity(D_{test})=exp\Big\{-\frac{\sum_{d=1}^{M}log[p(\textbf{w}_d)]}{\sum_{d=1}^{M}N_d}\Big\}$

As I understand, perplexity is directly proportional to log-likelihood. Thus, higher the log-likelihood, lower the perplexity.

Question:

Doesn't increasing log-likelihood indicate over-fitting? Criteria like AIC and BIC are specifically designed to take into account likelihood and penalize for number of parameters in the model to avoid over-fitting. Why is lower perplexity an indicator of better generalization performance?

Best Answer

This has less to do with perplexity, and more to do with cross-validation and test perplexity specifically. Here's a fuller excerpt from the paper, emphasis mine:

The perplexity, used by convention in language modeling, is monotonically decreasing in the likelihood of the test data, and is algebraicly equivalent to the inverse of the geometric mean per-word likelihood. A lower perplexity score indicates better generalization performance.

I.e, a lower perplexity indicates that the data are more likely. As referenced in your equation, the authors are calculating test set perplexity. In other words, they're estimating how well their model generalizes by testing it on unseen data. Incidentally, this allows them a practical comparison with competing models whose parameter spaces could be vastly different.

It's worth noting that your intuition—about higher log-likelihood or lower perplexity and overfitting—would well suit a training set. As overfitting occurs, a curve of training and test perplexity should resemble the learning curve plots you're probably familiar with: Training perplexity should continue decreasing but flatten out as overfitting occurs, while test perplexity should decrease and then increase in a parabolic sort of shape.

Best Answer

Related Solutions

Latent Dirichlet Allocation – How to Calculate Perplexity of a Holdout?

Solved – How to select GARCH lag for forecasting purpose (AIC+likelihood ratio)

Related Question