Solved – How to create a topic model with a mixture of multinomials and EM

I'm trying to create a topic model with a mixture of multinomials and the EM algorithm. I do not want to use a package. For reference, I'm implementing this in Python with numpy.

Data Sets

I have two data sets:

A vocabulary — a list of all words that occur in my documents
A list of tuples in the format DocumentID WordID WordCount

My Understanding of EM

My understanding of EM is as follows:

Take a generative model. In our case, take the multinomial model.
Initialize the parameters of this model to random or arbitrary values.
Loop:
1. E step — calculate the likelihood that our model explains the data
2. M step — update the parameters of our model to maximize the likelihood

The Multinomial Model

First, I load the data into a $D$-by-$W$ matrix $A$, where $D$ is the number of documents and $W$ is the number of words in my vocabulary. For any $1 \leq d \leq D$ and $1 \leq w \leq W$, $A_{dw}$ represents the word count of the word with ID $w$ in document with ID $d$.

My generative model is multinomial. So the likelihood for any document $A_{d}$ given my model is:

$$p(A_{d} | \hat{\alpha}^{[j]}) = \frac{(\sum_{w=1}^{W}{A_{dw}})!}{\prod_{w=1}^{W}{A_{dw}!}} \prod_{w=1}^{W}{(\hat{\alpha}^{[j]})}^{A_{dw}}$$

Where $1 \leq j \leq T$, $T$ is the number of topics, $W$ is the number of words in my vocabulary, and $A_{dw}$ is the word count of of word $w$ in document $A_{d}$. Since the big factorial term at the beginning is independent of $\alpha$, we don't have to worry about it. This simplifies our calculations to:

$$p(A_{d} | \hat{\alpha}^{[j]}) = \prod_{w=1}^{W}{(\hat{\alpha}^{[j]})}^{A_{dw}}$$

Since we have several documents, the join likelihood is:

$$p(A_{1}, A_{2}, …, A_{D}|\hat{\alpha}^{[j]}) = p(A_{1}|\hat{\alpha}^{[j]}) \cdot p(A_{2}|\hat{\alpha}^{[j]}) \cdot \cdot \cdot p(A_{D}|\hat{\alpha}^{[j]})$$

What I've Done

So, after loading my data into the matrix $A$ described above, I initialize parameter vectors $\hat{\alpha}^{[j]}$ where $1 \leq j \leq T$.

I think I understand how to calculate the joint likelihoods given each cluster:

initialize likelihoods vector of length T to all ones for each likelihood in likelihood vector: for each document: multiply likelihood by p( document | alpha_vector )

This will find the joint likelihood for each $\hat{\alpha}^{[j]}$. This is the E step.

So the basic loop for EM will be:

loop until satisfied: E step (as described above) M step

Where I'm Confused

I don't know what to do for the M step. I'm pretty sure I need to calculate

$$p(\hat{\alpha}^{[j]} | A)$$ for each topic, but I'm not sure of two things:

How can I do this?
Once I have these values, what do I do with them?

Resources I've Found

Here is a fantastic, short document which explains document classification with multinomials quite well.
Here's a short paper which discusses multinomial mixtures and touches on LDA.

Best Answer

The EM algorithm maximizes the log-likelihood:

$\theta^{\ast} = \arg\max_\theta \sum_{i=1}^{n} \log p(x_i|\theta) = \arg \max_\theta \sum_{i=1}^{n} \log \big[\sum_{z_i} p(x_i, z_i|\theta)\big]$

where $p(x_i, z_i | \theta)$ is the joint distribution between observed words $x_i$ and unobserved assignments $z_i$ parameterized by $\theta$. In the case of topic models, $\theta = \{\theta_d, \beta_{w|z}\}$ where $\theta_d \sim Dir(\alpha)$ are document-level topic proportions and $\beta_{w|z} \sim Dir(\eta)$ are corpus level topics.

E-step: Given $\theta^{(k)}$, we want to find the soft counts $n(z)$ and $n(w, z)$

First, we compute the posterior over the topic assignments:

$p(z|w, \theta^{(k)}) = \frac{p(z, w | \theta^{(k)})}{\sum_z p(z, w | \theta^{(k)})} = \frac{\theta_{z|d} \beta_{w|z}}{\sum_z \theta_{z|d} \beta_{w|z}}$

Given the posterior $p(z|w, \theta^{(k)})$, we can compute the soft counts:

$n(z) = \sum_{i=1}^{n_d} p(z|w_i, \theta^{(k)})$

$n(w,z) = \sum_{d=1}^{D} \sum_{i=1}^{n_d} p(z|w_i, \theta^{(k)}) \delta(w, w_i)$

where $\delta(w, w_i)$ is an indicator which is equal to 1 when $w=w_i$ and 0 otherwise.

M-step: Given the soft counts $n(z)$ and $n(w, z)$, we want to update $\theta^{(k)}$

$\theta_{z|d}^{(k+1)} = \frac{n(z)}{n_d}$

$\beta_{w|z}^{(k+1)} = \frac{n(w, z)}{\sum_{w \in V} n(w, z)}$

where $n_d$ is the total number of words in document $d$ and $V$ is our vocabulary.

Thus, we can initialize $\theta^{(0)}$ at random and iterate between the E-step and the M-step until convergence. The EM algorithm may require several restarts to avoid local optima. As a result of several iterations, the log-likelihood of observed words under the topic model will increase.

Data Sets

My Understanding of EM

The Multinomial Model

What I've Done

Where I'm Confused

Resources I've Found

Best Answer

Related Solutions

Solved – Running Latent Dirichlet Allocation (LDA) on word counts

Solved – latent dirichlet allocation: complexity and implementation details

Related Question