Solved – LDA topics number – determining the ‘fit’ level with current number of topics

latent-dirichlet-allocscikit learntopic-models

I have a more theoretical question about LDA (Latent Dirichet Allocation).

When doing LDA we provide number of topics ourselves. As far as I understand it tries to build topic-word-document distributions to minimize perplexity (which is why we are doing it in iterative manner).

So the question – can we fit the LDA several times with different number of topics, check the perplexity of each result and choose the number of topics which yielded minimal perplexity? Or am I misunderstanding the perplexity meaning and the algorithm itself – so the perplexity is actually not a 'fit measure' for LDA?

Best Answer

Yes, in fact this is the cross validation method of finding the number of topics. But note that you should minimize the perplexity of a held-out dataset to avoid overfitting.

It's worth noting that a non-parametric extension of LDA can derive the number of topics from the data without cross validation. Implementations exist at David Blei's lab github, but at the time of this writing I haven't see HDP LDA implemented in any mainstream, open-source ML libraries.

Caveat: Hierarchal LDA is different. It finds a hierarchy of topics, whereas hierarchal Dirichlet processes let you fit a potentially infinite number of flat topics. (The dual use of "hierarchy" can be confusing.)