Since a similarity/distance measure is crucial for every clustering algorithm, I wonder what this measure is for LDA.
Since LDA works on text as a bag-of-word model, can someone imagine the similarity between topics (clusters) are the representative words between those clusters?
For example:
- topic 1: [dog, cat, animal]
- topic 2: [dog, fetch, catch]
If those are the representative words, is the measure for clustering those topics, the similarity between them in vector space?
Greetings
Best Answer
LDA does not have a distance metric
The intuition behind the LDA topic model is that words belonging to a topic appear together in documents. Unlike typical clustering algorithms like K-Means, it does not assume any distance measure between topics. Instead it infers topics purely based on word counts, based on the bag-of-words representation of documents.
This can be appreciated from the Gibbs sampler described in paper by Griffiths et al.:
$$ P(z_i=j \mid \textbf{z}_{-i} , \textbf{w} ) \propto \frac{n^{(w_i)}_{-i,j}+\beta}{n^{(.)}_{-i,j}+W\beta} \times \frac{n^{(d_i)}_{-i,j}+\alpha}{n^{(d_i)}_{-i,.}+T\alpha} $$
$P(z_i=j \mid \textbf{z}_{-i} , \textbf{w} )$ refers to the probability of assigning topic $j$ to $i^{th}$ word, given all other assignments. This depends on two probabilities:
These probabilities can be easily computed using the following counts:
Note that all counts are excluding the current assignment, denoted by the $-i$ subscript.
Why does LDA work?
Referring to these Video Lectures, David Blei attributes it to the following: