Solved – Evaluation of LDA

topic-models

I am looking for a C++/Java implementation for computing the perplexity of held-out document in Latent Dirichlet allocation. Can anybody suggest useful links?

Best Answer

I did some googling about UMass MALLET (Java) library.

You can use its functions to calculate the log probability of each document, $log(p(\mathbf{w}_d))$, in your hold-out set, and then from that you can easily calculate perplexity according to the formula from the LDA paper:

$$ perplexity(D_{hold out}) = \exp({-\frac{\sum_{d=1}^{M} log(p(\mathbf{w}_d))}{\sum_{d=1}^{M} N_d}}) $$

where $M$ is the # of documents, $N_d$ is the number of words or tokens per document, and $\mathbf{w}_d$ is the sequence of words in document $d$.

I found a forum post that describes some of the steps to doing this: http://t3527.ai-mallet-development.aitalk.info/model-perplexity-t3527.html

Hope that helps!