Latent Dirichlet Allocation – How to Calculate Perplexity of a Holdout?

text miningtopic-models

I'm confused about how to calculate the perplexity of a holdout sample when doing Latent Dirichlet Allocation (LDA). The papers on the topic breeze over it, making me think I'm missing something obvious…

Perplexity is seen as a good measure of performance for LDA. The idea is that you keep a holdout sample, train your LDA on the rest of the data, then calculate the perplexity of the holdout.

The perplexity could be given by the formula:

$per(D_{test})=exp\{-\frac{\sum_{d=1}^{M}\log p(\mathbb{w}_d)}{\sum_{d=1}^{M}N_d}\} $

(Taken from Image retrieval on large-scale image databases, Horster et al.)

Here $M$ is the number of documents (in the test sample, presumably), $\mathbb{w}_d$ represents the words in document $d$, $N_d$ the number of words in document $d$.

It is not clear to me how to sensibly calcluate $p(\mathbb{w}_d)$, since we don't have topic mixtures for the held out documents. Ideally, we would integrate over the Dirichlet prior for all possible topic mixtures and use the topic multinomials we learned. Calculating this integral doesn't seem an easy task however.

Alternatively, we could attempt to learn an optimal topic mixture for each held out document (given our learned topics) and use this to calculate the perplexity. This would be doable, however it's not as trivial as papers such as Horter et al and Blei et al seem to suggest, and it's not immediately clear to me that the result will be equivalent to the ideal case above.

Best Answer

This is indeed something often glossed over.

Some people are doing something a bit cheeky: holding out a proportion of the words in each document, and giving using predictive probabilities of these held-out words given the document-topic mixtures as well as the topic-word mixtures. This is obviously not ideal as it doesn't evaluate performance on any held-out documents.

To do it properly with held-out documents, as suggested, you do need to "integrate over the Dirichlet prior for all possible topic mixtures". http://people.cs.umass.edu/~wallach/talks/evaluation.pdf reviews a few methods for tackling this slightly unpleasant integral. I'm just about to try and implement this myself in fact, so good luck!