Solved – Hierarchical Dirichlet Processes in topic modeling

clusteringdirichlet distributiontopic-models

I think I understand the main ideas of hierarchical dirichlet processes, but I don't understand the specifics of its application in topic modeling. Basically, the idea is that we have the following model:

$$G_{0}\sim DP(\gamma, H)$$
$$G_{j}\sim DP(\alpha_{0}, G_{0})$$
$$\phi_{ji} \sim G_{j}$$
$$x_{ji} \sim F(\phi_{ji})$$

We sample from a Dirichlet process with a base distribution $H$ to obtain a discrete distribution $G_{0}$. Then, we use $G_{0}$ in another Dirichlet process $G_{j}$ for every $j$ (in topic modeling, $j$ is supposed to represent documents and $G_{j}$ is a distribution over topics for document $j$). After this, for each word in document $j$, sample from $G_{j}$ in order to select a particular topic. Some sources say that this is parameter associated to the topic and not properly a topic. In any case, this is acting as a latent variable. Finally, for each document $j$ and word $i$, $x_{ji}$ is described as a distribution $F$ that depends on the latent variable $\phi_{ji}$ associated in some way to the selected topic.

The question is: How do you describe explicitly $F(\phi_{ji})$? I think I have seen a multinomial distribution there, but I'm not sure about it. As a comparison, in LDA we need for each topic a distribution over words and a multinomial distribution is required. What is the equivalent procedure here and what it represents in terms of words, documents and topics?

Best Answer

I found this truly excellent review that describes precisely how Hierarchical Dirichlet Processes work.

First, start by choosing a base distribution $H$. In the case of topic modeling, we have a Dirichlet distribution as a prior for $H$. The dimension of this distribution should describe a distribution of words for each topic. Therefore, the dimension should be equal to the size of the vocabulary $V$. In the example described in the review, the author assumes a vocabulary of 10 words, so he uses $H = \text{Dirichlet}(1/10,...,1/10)$. As usual, a realization of this distribution generates a 10-dimensional vector $\theta_{k}$ of proportions.

After this, $H$ is used to build a Dirichlet Process $DP(\gamma, H)$ and a realization $G_{0}$ of this process is another discrete distribution with locations $\{\theta_{k}\}$ where each $\theta_{k}$ describes the distribution over words for a topic $k$. If we use $G_{0}$ as a base distribution for another Dirichlet Process $DP(\alpha_{0}, G_{0})$, it is possible to obtain a realization $G_{j}$ for every document $j$ in such a way that $G_{j}$ has the same support as $G_{0}$. Therefore, every $G_{j}$ shares the same set of $\theta_{k}$'s, although with different proportions (which are called mixing weights in the definition of a Dirichlet Process)

Finally, for every document $j$ and every word $i$, we draw a realization from $G_{j}$ which generates a particular vector $\theta_{k}$. Since this $\theta_{k}$ is a distribution over words for a given topic, we only need to sample from a multinomial distribution using $\theta_{k}$ as parameter in order to sample words $w_{ji}= \text{Multinomial}(\theta_{k})$.

I have seen that sometimes $\phi_{ji}$ is defined as $\phi_{ji}=\theta_{k}$ for every document $j$ and word $i$. Sometimes, it is easier to use a variable $z_{ji}$ that works as an index to sample from the probabilities $\pi_{jk}$ of $G_{j}$ (in $G_{j} = \sum_{k=1}^{\infty} \pi_{jk} \delta_{\theta_{k}}$) and then used as in $\theta_{z_{ji}}$. However, I think this is done in the context of the stick-breaking construction.

Related Solutions

Solved – Labeling documents with short text labels after topic modeling

In general, there's no reason to assume that the distributions over words—topics, in model parlance—will give highest probability to the most natural "label" for the topic.

You can see this in the sample topics and corresponding top words shown in the paper introducing LDA:

"Arts"  "Budgets"   "Chidren"   "Education"
NEW     MILLION     CHILDREN    SCHOOL
FILM    TAX         WOMEN       STUDENTS
SHOW    PROGRAM     PEOPLE      SCHOOLS
MUSIC   BUDGET      CHILD       EDUCATION
MOVIE   BILLION     YEARS       TEACHERS
PLAY    FEDERAL     FAMILIES    HIGH
MUSICAL YEAR        WORK        PUBLIC
BEST    SPENDING    PARENTS     TEACHER
ACTOR   NEW         SAYS        BENNETT
FIRST   STATE       FAMILY      MANIGAT
YORK    PLAN        WELFARE     NAMPHY
OPERA   MONEY       MEN         STATE
THEATER PROGRAMS    PERCENT     PRESIDENT
ACTRESS GOVERNMENT  CARE        ELEMENTARY
LOVE    CONGRESS    LIFE        HAITI

This, I'd argue, shows that the choice of top word is a little problematic. "Children" is both top term and topic, but "arts"—a pretty natural label for the first topic—doesn't even appear in its top words. "Education" is in the top five, "budgets" too if we're flexible. Choosing the top word of each makes no sense in the first two cases, and reasonable sense in the latter.

Of course, these labels were manually, subjectively chosen by the authors, and you might have labelled them differently. I myself would have used "family" instead of "children" for the third. More, the topics would change if you altered $k, \alpha$ or $\eta$.

You can set $k$ to minimize perplexity, but will these be meaningful to typical readers? At what threshold should each mixing proportion reach before labeling a document with that topic?

One extension, supervised LDA (sLDA) introduced by Blei / McAuliffe - Supervised topic models, allows you to both fit distributions over words and model responses. Meaning, if you have or can develop a labelled corpus, you could build models that predict whether a given label applied to a new document. This would let you dodge the problem of finding meaning in fitted topics: You would start with what's meaningful and fit topics to it.

Solved – Given an LDA model, how can I calculate p(word|topic,party), where each document belongs to a party

You may want to consider using a structural topic model: http://www.structuraltopicmodel.com/ It's an extension of the correlated topic model (CTM) presented by Blei & Lafferty (2007), allowing for the use of covariates and metadata, such as party in your example. The stm package in R is fantastic, and very easy to use, IMO. There are several published articles showing examples of structural topic models --- see the list of references on the stm website, linked above.

Best Answer

Related Solutions

Solved – Labeling documents with short text labels after topic modeling

Solved – Given an LDA model, how can I calculate p(word|topic,party), where each document belongs to a party

Related Question