Solved – latent dirichlet allocation: complexity and implementation details

expectation-maximizationlatent-dirichlet-allocoptimizationtopic-modelsvariational-bayes

I was confused by how LDA (by the original variational inference) can be implemented in a way such that the number of operations for each document $j$ is $\mathcal{O}(N_j~K)$, where $N_j$ is the unique number of words in $j$ and $K$ is the number of topics. In the original paper by Blei, their inference is done by scanning through each individual word (not unique) for the variational parameter $\gamma_{jik}$ where $i$ is the $i$th word in document $j$.

In an email from the topic modeling mailing list, the authors (Prof. Blei) also said that

the complexity of mean-field variational inference is actually
O(NKV). the reason is that, in a document, we need not compute
posterior multinomials for each instance of each term but only once
for each unique term in that document. in the LDA paper, we did not
write things down that way to make the math simpler. (though, we
should have mentioned this speed-up.)

if you'll look at the LDA-C code, which is on my web-site, you'll see
that we only need to iterate through the unique terms of each document
for each iteration of variational inference.

For me, it will take up too much time to look at the C code. Can anyone provide an intuitive way on how to implement VB LDA by iterating thru each document's unique word?

Best Answer

Thanks to Prof. Dave Blei's reply, I now got it.

So it turns out that this simply comes from some algebra trick. There's indeed a variational multinomial distribution, with parameter $\phi_{jik}$, for the $i$th word and $j$th document, regarding the $k$th topic/component. When we are doing updates $\phi_{jik}$, it goes like this: $$\phi_{jik}\propto \beta_{kw_i}~exp\Big(E\big[log\theta_{jk}\big] + E\big[log\beta_{kw_i}\big]\Big)$$

where $w_i$ is the index of word $i$. Therefore, the update will be the same if two words are the same element in the vocabulary (say, word 1 and word 17 are both "bayes"). In this case, we can collapse them into one single term, namely "term", instead of word, variational factor. After we do so, we need to aggregate the effects of multiple terms when updating other parameters like $\gamma$ and $\lambda$. There's no need to re-derive the equations.

Reference: M. D. Hoffman, D. M. Blei, and F. Bach, “Online Learning for Latent Dirichlet Allocation,” Adv. Neural Inf. Process. Syst., vol. 23, pp. 1–9, 2010.