Solved – Hierarchical Dirichlet Processes in topic modeling

clusteringdirichlet distributiontopic-models

I think I understand the main ideas of hierarchical dirichlet processes, but I don't understand the specifics of its application in topic modeling. Basically, the idea is that we have the following model:

$$G_{0}\sim DP(\gamma, H)$$
$$G_{j}\sim DP(\alpha_{0}, G_{0})$$
$$\phi_{ji} \sim G_{j}$$
$$x_{ji} \sim F(\phi_{ji})$$

We sample from a Dirichlet process with a base distribution $H$ to obtain a discrete distribution $G_{0}$. Then, we use $G_{0}$ in another Dirichlet process $G_{j}$ for every $j$ (in topic modeling, $j$ is supposed to represent documents and $G_{j}$ is a distribution over topics for document $j$). After this, for each word in document $j$, sample from $G_{j}$ in order to select a particular topic. Some sources say that this is parameter associated to the topic and not properly a topic. In any case, this is acting as a latent variable. Finally, for each document $j$ and word $i$, $x_{ji}$ is described as a distribution $F$ that depends on the latent variable $\phi_{ji}$ associated in some way to the selected topic.

The question is: How do you describe explicitly $F(\phi_{ji})$? I think I have seen a multinomial distribution there, but I'm not sure about it. As a comparison, in LDA we need for each topic a distribution over words and a multinomial distribution is required. What is the equivalent procedure here and what it represents in terms of words, documents and topics?

Best Answer

I found this truly excellent review that describes precisely how Hierarchical Dirichlet Processes work.

First, start by choosing a base distribution $H$. In the case of topic modeling, we have a Dirichlet distribution as a prior for $H$. The dimension of this distribution should describe a distribution of words for each topic. Therefore, the dimension should be equal to the size of the vocabulary $V$. In the example described in the review, the author assumes a vocabulary of 10 words, so he uses $H = \text{Dirichlet}(1/10,...,1/10)$. As usual, a realization of this distribution generates a 10-dimensional vector $\theta_{k}$ of proportions.

After this, $H$ is used to build a Dirichlet Process $DP(\gamma, H)$ and a realization $G_{0}$ of this process is another discrete distribution with locations $\{\theta_{k}\}$ where each $\theta_{k}$ describes the distribution over words for a topic $k$. If we use $G_{0}$ as a base distribution for another Dirichlet Process $DP(\alpha_{0}, G_{0})$, it is possible to obtain a realization $G_{j}$ for every document $j$ in such a way that $G_{j}$ has the same support as $G_{0}$. Therefore, every $G_{j}$ shares the same set of $\theta_{k}$'s, although with different proportions (which are called mixing weights in the definition of a Dirichlet Process)

Finally, for every document $j$ and every word $i$, we draw a realization from $G_{j}$ which generates a particular vector $\theta_{k}$. Since this $\theta_{k}$ is a distribution over words for a given topic, we only need to sample from a multinomial distribution using $\theta_{k}$ as parameter in order to sample words $w_{ji}= \text{Multinomial}(\theta_{k})$.

I have seen that sometimes $\phi_{ji}$ is defined as $\phi_{ji}=\theta_{k}$ for every document $j$ and word $i$. Sometimes, it is easier to use a variable $z_{ji}$ that works as an index to sample from the probabilities $\pi_{jk}$ of $G_{j}$ (in $G_{j} = \sum_{k=1}^{\infty} \pi_{jk} \delta_{\theta_{k}}$) and then used as in $\theta_{z_{ji}}$. However, I think this is done in the context of the stick-breaking construction.

Related Question