It is said that the key inferential problem that needs to be solved to use LDA (latent Dirichlet allocation) is that of computing the posterior distribution $p(\theta,z | w, \alpha ,\beta)$. I know LDA inference was first presented using variational inference on a simplification of LDA's graphical model, but other methods such as Gibbs Sampling allow to estimate $p(\theta,z | w, \alpha ,\beta)$.
After calculating $p(\theta,z | w, \alpha ,\beta)$, how is it used afterwards? How can we do document classification with $p(\theta,z | w, \alpha ,\beta)$?
(Notation: the same in the original LDA paper and on Wikipedia.)
Best Answer
How to classify documents
Importantly, latent Dirichlet allocation is an unsupervised method: On its own, it doesn't account for the class or category of a document. But, as discussed in section 7.2 of the paper that introduced it, it can be used to develop features for classification:
So as a general, practical answer to your second question: Parameters of the topic distribution for a document can be used as features in a classifier of your choice. That's exactly what the authors of LDA did in their experiments:
Here's an example of what this could look like in python. It transforms the digits dataset from
sklearn
to a 16-topic space, then predicts the digit using logistic regression. (Sixteen chosen rather arbitrarily after some exploration in my answer here.)Which gives reasonable results—here's the confusion matrix it prints:
For a text application, see this classification example from the
sklearn
docs.Uses for the posterior distributions
To your first question, there are still uses for LDA topics outside of classification, namely that extracted topics can give a descriptive summary of a corpus. Another
sklearn
example does this on the 20 newsgroups dataset, and prints the top words of topics. Here's it's output:You can already see some intuitive overlap with the newsgroup names, described here, e.g.
talk.politics.guns
,talk.religion.misc
. You can carry this descriptive analysis further, but exactly how depends much on your interest.