Solved – Running Latent Dirichlet Allocation (LDA) on word counts

count-datamultinomial-distributiontopic-models

I have difficulties understanding the VB implementation lda-c. In particular, the method expects as input a bag-of-words representation of documents, where distinct words appearing in a document are mapped to the number of their occurence in that document. The generative model, however, seems to be specified in terms of a sequence of words per document denoted by $w_{d, n} \in \{1,\ldots,V\}$ for the $n$-th word taken from our vocabulary of size $V$ in the $d$-th document. Each word is assigned a latent topic variable $z_{d,n} \in \{1, \ldots, K\}$. Given $z_{d,n}$ and word distributions $\beta_k$ per topic $k$, the likelihood is written as
$$
P(w_{d,n} \mid z_{d,n}, \beta) = \beta_{w_{d,n}, z_{d,n}}
$$
This formulation seems to specifically rely on knowing each word in the document.

So if I instead observe only a histogram of words $n_d$ where $n_{d,j}$ is the number of times word $j$ appears in document $d$, how would the likelihood $P(n_{d,n} \mid z_d, \beta)$ look like?

It feels like there should be a way to rewrite $P(w \mid z, \beta)$ as $P(n \mid z, \beta)$ as the word sequence is really exchangeable.

Best Answer

When you only have word counts but no documents you can generate a document from the word counts by putting the words according to their frequency in random order into the document. Then the algorithm runs smoothly without rewriting.