Solved – Given a topic distribution over words from LDA model how to calculate document distribution over topics for new document

latent-variablespark-mllib

I'm using Spark 1.6.2 via the Python API. It seems that as of when this post is being written, the only data available from the LDA (latent Dirichlet allocation) model calculations is a topic distribution over words i.e. p(word | topic). What I would like to be able to do is assign a topic to each individual word in a document p(topic | word) calculate a topic distribution over documents p(topic | document).

Is Gibb's sampling the only way to do this given the data that I have available? If I were to implement Gibb's sampling, could I keep p(word | topic) fixed and just resample p(topic | document) until it converges?

Are there alternatives to Gibb's sampling that are a) more direct or straightforward to implement and/or b) would actually execute faster (I imagine Gibb's sampling isn't the speediest algorithm)?

Best Answer

The online VB algorithm for LDA implemented in gensim and scikit in addition to computing topic distributions on new unseen documents implicitly computes $q(z_{id} = k) = \phi_{dwk}$ which is the probability of assigning a given a word $i$ in document $d$ to topic $k$. It is computed implicitly to save storage space (where for online LDA: $d$ is a batch-size of documents instead of the entire corpus). $\phi_{dwk}$ can be seen as a proxy to what the collapsed Gibbs sampler is doing by assigning topic labels to individual words in every document.

Related Question