Machine Learning – Why LDA (Latent Dirichlet Allocation) Works by Grouping Co-Occurring Words

latent-variablemachine learningstatistical significance

I am studying LDA, but have very weak statistical knowledge. I have a question regarding Gibbs sampling, one of the methods for inferring the distribution of topics and words-topic given a document, which basically iterates and computes the probability of words from being assigned to each topic after removing the specific word from the counts. The question is, why it works ?

I found an explanation below, but I am not able to understand the parts in bold …

Word probabilities are maximized by dividing the words among the
topics. (More terms means more mass to be spread around.) In a
mixture, this is enough to find clusters of co-occurring words. In
LDA, the Dirichlet on the topic proportions can encourage sparsity,
i.e., a document is penalized for using many topics. Loosely, this
can be thought of as softening the strict definition of
“co-occurrence” in a mixture model. This flexibility leads to sets of
terms that more tightly co-occur
.

Best Answer

Technically LDA Gibbs sampling works because we intentionally set up a Markov chain that converges into the posterior distribution of the model parameters, or word–topic assignments. See http://en.wikipedia.org/wiki/Gibbs_sampling#Mathematical_background.

But I guess you are seeking an intuitive answer on why the sampler tends to put similar words into the same topic? That's an interesting question. If you look at the equations for collapsed Gibbs sampling, there is a factor for words, another for documents. Probabilities are higher for assignments that "don't break document boundaries", that is, words appearing in the same document have a slightly higher odds of ending up in the same topic. The same holds for document assignments, they to a degree follow "word boundaries". These effects mix up and spread over clusters of documents and words, eventually.

By the way, LDA Gibbs samplers do not actually work properly, in the sense that they do not mix, or are not able to represent the posterior distribution well. If they did, the permutation symmetries of the model would make all solutions obtained by samplers useless, or at least non-interpretable. Instead the sampler sticks around a local mode (of the likelihood), and we get well-defined topics.

Related Question