Solved – Clustering with Latent dirichlet allocation (LDA): Distance Measure

clusteringdistancelatent-dirichlet-allocsimilarities

Since a similarity/distance measure is crucial for every clustering algorithm, I wonder what this measure is for LDA.
Since LDA works on text as a bag-of-word model, can someone imagine the similarity between topics (clusters) are the representative words between those clusters?

For example:

  • topic 1: [dog, cat, animal]
  • topic 2: [dog, fetch, catch]

If those are the representative words, is the measure for clustering those topics, the similarity between them in vector space?

Greetings

Best Answer

LDA does not have a distance metric

The intuition behind the LDA topic model is that words belonging to a topic appear together in documents. Unlike typical clustering algorithms like K-Means, it does not assume any distance measure between topics. Instead it infers topics purely based on word counts, based on the bag-of-words representation of documents.

This can be appreciated from the Gibbs sampler described in paper by Griffiths et al.:

$$ P(z_i=j \mid \textbf{z}_{-i} , \textbf{w} ) \propto \frac{n^{(w_i)}_{-i,j}+\beta}{n^{(.)}_{-i,j}+W\beta} \times \frac{n^{(d_i)}_{-i,j}+\alpha}{n^{(d_i)}_{-i,.}+T\alpha} $$

$P(z_i=j \mid \textbf{z}_{-i} , \textbf{w} )$ refers to the probability of assigning topic $j$ to $i^{th}$ word, given all other assignments. This depends on two probabilities:

  1. Probability of word $w_i$ in topic $j$
  2. Probability of topic $j$ in document $d_i$

These probabilities can be easily computed using the following counts:

  • $n^{(w_i)}_{-i,j}:$ number of times word $w_i$ was assigned to topic $j$
  • $n^{(.)}_{-i,j}:$ total number of words assigned to topic $j$
  • $n^{(d_i)}_{-i,j}:$ number of times topic $j$ was assigned in document $d_i$
  • $n^{(d_i)}_{-i,.}:$ total number of topics assigned in document $d_i$
  • $T:$ number of topics
  • $W:$ number of words in vocabulary
  • $\alpha, \beta:$ Dirichlet hyperparameters

Note that all counts are excluding the current assignment, denoted by the $-i$ subscript.


Why does LDA work?

Referring to these Video Lectures, David Blei attributes it to the following:

Why LDA works

Related Question