Solved – How to create a topic model with a mixture of multinomials and EM

expectation-maximizationmixture-distributionmultinomial-distribution

I'm trying to create a topic model with a mixture of multinomials and the EM algorithm. I do not want to use a package. For reference, I'm implementing this in Python with numpy.

Data Sets

I have two data sets:

  1. A vocabulary — a list of all words that occur in my documents
  2. A list of tuples in the format DocumentID WordID WordCount

My Understanding of EM

My understanding of EM is as follows:

  1. Take a generative model. In our case, take the multinomial model.
  2. Initialize the parameters of this model to random or arbitrary values.
  3. Loop:
    1. E step — calculate the likelihood that our model explains the data
    2. M step — update the parameters of our model to maximize the likelihood

The Multinomial Model

First, I load the data into a $D$-by-$W$ matrix $A$, where $D$ is the number of documents and $W$ is the number of words in my vocabulary. For any $1 \leq d \leq D$ and $1 \leq w \leq W$, $A_{dw}$ represents the word count of the word with ID $w$ in document with ID $d$.

My generative model is multinomial. So the likelihood for any document $A_{d}$ given my model is:

$$p(A_{d} | \hat{\alpha}^{[j]}) = \frac{(\sum_{w=1}^{W}{A_{dw}})!}{\prod_{w=1}^{W}{A_{dw}!}} \prod_{w=1}^{W}{(\hat{\alpha}^{[j]})}^{A_{dw}}$$

Where $1 \leq j \leq T$, $T$ is the number of topics, $W$ is the number of words in my vocabulary, and $A_{dw}$ is the word count of of word $w$ in document $A_{d}$. Since the big factorial term at the beginning is independent of $\alpha$, we don't have to worry about it. This simplifies our calculations to:

$$p(A_{d} | \hat{\alpha}^{[j]}) = \prod_{w=1}^{W}{(\hat{\alpha}^{[j]})}^{A_{dw}}$$

Since we have several documents, the join likelihood is:

$$p(A_{1}, A_{2}, …, A_{D}|\hat{\alpha}^{[j]}) = p(A_{1}|\hat{\alpha}^{[j]}) \cdot p(A_{2}|\hat{\alpha}^{[j]}) \cdot \cdot \cdot p(A_{D}|\hat{\alpha}^{[j]})$$

What I've Done

So, after loading my data into the matrix $A$ described above, I initialize parameter vectors $\hat{\alpha}^{[j]}$ where $1 \leq j \leq T$.

I think I understand how to calculate the joint likelihoods given each cluster:

initialize likelihoods vector of length T to all ones
for each likelihood in likelihood vector:
for each document:
multiply likelihood by p( document | alpha_vector )

This will find the joint likelihood for each $\hat{\alpha}^{[j]}$. This is the E step.

So the basic loop for EM will be:

loop until satisfied:
E step (as described above)
M step

Where I'm Confused

I don't know what to do for the M step. I'm pretty sure I need to calculate

$$p(\hat{\alpha}^{[j]} | A)$$ for each topic, but I'm not sure of two things:

  1. How can I do this?
  2. Once I have these values, what do I do with them?

Resources I've Found

Best Answer

The EM algorithm maximizes the log-likelihood:

$\theta^{\ast} = \arg\max_\theta \sum_{i=1}^{n} \log p(x_i|\theta) = \arg \max_\theta \sum_{i=1}^{n} \log \big[\sum_{z_i} p(x_i, z_i|\theta)\big]$

where $p(x_i, z_i | \theta)$ is the joint distribution between observed words $x_i$ and unobserved assignments $z_i$ parameterized by $\theta$. In the case of topic models, $\theta = \{\theta_d, \beta_{w|z}\}$ where $\theta_d \sim Dir(\alpha)$ are document-level topic proportions and $\beta_{w|z} \sim Dir(\eta)$ are corpus level topics.

E-step: Given $\theta^{(k)}$, we want to find the soft counts $n(z)$ and $n(w, z)$

First, we compute the posterior over the topic assignments:

$p(z|w, \theta^{(k)}) = \frac{p(z, w | \theta^{(k)})}{\sum_z p(z, w | \theta^{(k)})} = \frac{\theta_{z|d} \beta_{w|z}}{\sum_z \theta_{z|d} \beta_{w|z}}$

Given the posterior $p(z|w, \theta^{(k)})$, we can compute the soft counts:

$n(z) = \sum_{i=1}^{n_d} p(z|w_i, \theta^{(k)})$

$n(w,z) = \sum_{d=1}^{D} \sum_{i=1}^{n_d} p(z|w_i, \theta^{(k)}) \delta(w, w_i)$

where $\delta(w, w_i)$ is an indicator which is equal to 1 when $w=w_i$ and 0 otherwise.

M-step: Given the soft counts $n(z)$ and $n(w, z)$, we want to update $\theta^{(k)}$

$\theta_{z|d}^{(k+1)} = \frac{n(z)}{n_d}$

$\beta_{w|z}^{(k+1)} = \frac{n(w, z)}{\sum_{w \in V} n(w, z)}$

where $n_d$ is the total number of words in document $d$ and $V$ is our vocabulary.

Thus, we can initialize $\theta^{(0)}$ at random and iterate between the E-step and the M-step until convergence. The EM algorithm may require several restarts to avoid local optima. As a result of several iterations, the log-likelihood of observed words under the topic model will increase.

Related Question