I've been experimenting with LDA topic modelling using Gensim. I couldn't seem to find any topic model evaluation facility in Gensim, which could report on the perplexity of a topic model on held-out evaluation texts thus facilitates subsequent fine tuning of LDA parameters (e.g. number of topics). It would be greatly appreciated if anyone could shed some light on how I can perform topic model evaluation in Gensim. This question had also been posted on Stackoverflow.
Solved – Topic models evaluation in Gensim
natural languagepythontopic-models
Related Solutions
This is a late answer, but it can be useful for other people searching for related research and tools for this problem:
Weiwei Guo from Columbia implemented code for short-text topic modeling. He described the implementation in the paper "Modeling Sentences in the Latent Space" (http://aclweb.org/anthology-new/P/P12/P12-1091v2.pdf) and the code is available here: http://www.cs.columbia.edu/~weiwei/code.html
Although this is not topic modeling, if you have a classification task involving short pieces of texts, you can use LibShortText. From their web site description
"LibShortText is an open source tool for short-text classification and analysis. It can handle the classification of, for example, titles, questions, sentences, and short messages..."
Recently, a huge body of literature discussing how to extract information from written text has grown. Hence I will just describe four milestones/popular models and their advantages/disadvantages and thus highlight (some of) the main differences (or at least what I think are the main/most important differences).
You mention the "easiest" approach, which would be to cluster the documents by matching them against a predefined query of terms (as in PMI). These lexical matching methods however might be inaccurate due to polysemy (multiple meanings) and synonymy (multiple words that have similar meanings) of single terms.
As a remedy, latent semantic indexing (LSI) tries to overcome this by mapping terms and documents into a latent semantic space via a singular value decomposition. The LSI results are more robust indicators of meaning than individual terms would be. However, one drawback of LSI is that it lacks in terms of solid probabilistic foundation.
This was partly solved by the invention of probabilistic LSI (pLSI). In pLSI models each word in a document is drawn from a mixture model specified via multinomial random variables (which also allows higher-order co-occurences as @sviatoslav hong mentioned). This was an important step forward in probabilistic text modeling, but was incomplete in the sense that it offers no probabilistic structure at the level of documents.
Latent Dirichlet Allocation (LDA) alleviates this and was the first fully probabilistic model for text clustering. Blei et al. (2003) show that pLSI is a maximum a-posteriori estimated LDA model under a uniform Dirichlet prior.
Note that the models mentioned above (LSI, pLSI, LDA) have in common that they are based on the “bag-of-words” assumption - i.e. that within a document, words are exchangeable, i.e. the order of words in a document can be neglected. This assumption of exchangeability offers a further justification for LDA over the other approaches: Assuming that not only words within documents are exchangeable, but also documents, i.e., the order of documents within a corpus can be neglected, De Finetti's theorem states that any set of exchangeable random variables has a representation as a mixture distribution. Thus if exchangeability for documents and words within documents is assumed, a mixture model for both is needed. Exactly this is what LDA generally achieves but PMI or LSI do not (and even pLSI not as beautiful as LDA).
Best Answer
Found the answer on the gensim mailing list.
In short, the bound() method of LdaModel computes a lower bound on perplexity, based on a held-out corpus.