Solved – Given an LDA model, how can I calculate p(word|topic,party), where each document belongs to a party

conditional probabilitylatent-dirichlet-allocmaximum likelihoodnatural languagetopic-models

I have an LDA (latent dirichet allocation) model trained over a corpus of documents, where each document is associated with a political party. I'd like to arrive at $p(w|z,party)$ for each word $w$, topic $z$, and party $party$.

From the output of LDA, I have a distribution over topics for each document (i.e. $\theta$) and a distribution over words for each topic (i.e. $\phi$).
To combine documents into parties, I'm simply averaging the document-topic distributions for each party's documents. This gives me a distribution over topics for each party, $\theta'$.

First of all, is this a valid thing to do?

Second of all, I now need to get to $p(w|z,party)$ using $\theta'$ and $\phi$. This answer seems to suggest that I can simply do the following:

$p(w|z,party) = \large\frac{\theta'_{party,z}\phi_{z,w}}{\sum_{v \in W}\theta'_{party,z}\phi_{z,v}}$

Is that correct? If so, can someone explain why?

Note: The original paper for LDA uses $\beta$, not $\phi$, for the topic-word distributions. I am using $\phi$ for consistency with the linked question.

Best Answer

You may want to consider using a structural topic model: http://www.structuraltopicmodel.com/ It's an extension of the correlated topic model (CTM) presented by Blei & Lafferty (2007), allowing for the use of covariates and metadata, such as party in your example. The stm package in R is fantastic, and very easy to use, IMO. There are several published articles showing examples of structural topic models --- see the list of references on the stm website, linked above.

Related Question