Solved – When is it ok to *not* use a held-out set for topic model evaluation

topic-models

Typically, I see people calculate the perplexity/likelihood of a topic model against a held out set of documents, but is this always necessary/appropriate?

I'm thinking in particular of a use case in which we're modeling a full corpus, with no plans to use the model for prediction on new documents. In other words, the corpus is complete and self-contained, and the topic model is only for exploration and dimensionality reduction on the original corpus.

At least intuitively, this seems to be a case where overfitting isn't really possible/meaningful, so isn't it best to calculate the model likelihood/perplexity against the full training set of documents, and try to minimize that value?

Best Answer

"but is this always necessary/appropriate?"

No, this is not always necessary. Many papers, e.g. "Improving Topic Models with Latent Feature Word Representations", use topic coherence and/or document clustering and/or document classification and/or information retrieval to compare topic models other than computing perplexity on held-out data.