Solved – How should perplexity of LDA behave as value of the latent variable k increases

latent-dirichlet-alloclatent-variableperplexity

When increasing the value of the latent variable k for LDA (latent Dirichlet allocation), how should perplexity behave:

On the training set?
On the testing set?

Best Answer

The original paper on LDA gives some insights into this:

In particular, we computed the perplexity of a held-out test set to evaluate the models. The perplexity, used by convention in language modeling, is monotonically decreasing in the likelihood of the test data, and is algebraicly equivalent to the inverse of the geometric mean per-word likelihood. A lower perplexity score indicates better generalization performance.

This should be the behavior on test data. Here is a result from paper:

Related Solutions

Solved – Using LDA to calculate similarity

(Pseudo-code) Computing similarity between two documents (doc1, doc2) using existing LDA model:

lda_vec1, lda_vec2 = lda(doc1), lda(doc2)
score <- similarity(lda_vec1, lda_vec2)

In the first step, you simply apply your LDA model on the two input documents, getting back a vector for each document. The vector represents the topic distribution for the document.

The second step is to apply a similarity measure of your choice to compare the two vectors. You should experiment with different types of similarity measures to see which one works best in your case. Some good options to consider for distance metrics are cosine distance and Hellinger distance. Note that the underlying assumption here is that we consider two documents to be similar if their presumed topics are similar.

Example using Cosine similarity:

similarity = gensim.matutils.cossim(lda_vec1, lda_vec2)

Solved – Why does lower perplexity indicate better generalization performance

This has less to do with perplexity, and more to do with cross-validation and test perplexity specifically. Here's a fuller excerpt from the paper, emphasis mine:

The perplexity, used by convention in language modeling, is monotonically decreasing in the likelihood of the test data, and is algebraicly equivalent to the inverse of the geometric mean per-word likelihood. A lower perplexity score indicates better generalization performance.

I.e, a lower perplexity indicates that the data are more likely. As referenced in your equation, the authors are calculating test set perplexity. In other words, they're estimating how well their model generalizes by testing it on unseen data. Incidentally, this allows them a practical comparison with competing models whose parameter spaces could be vastly different.

It's worth noting that your intuition—about higher log-likelihood or lower perplexity and overfitting—would well suit a training set. As overfitting occurs, a curve of training and test perplexity should resemble the learning curve plots you're probably familiar with: Training perplexity should continue decreasing but flatten out as overfitting occurs, while test perplexity should decrease and then increase in a parabolic sort of shape.

Best Answer

Related Solutions

Solved – Using LDA to calculate similarity

Solved – Why does lower perplexity indicate better generalization performance

Related Question