I'm trying to create a topic model with a mixture of multinomials and the EM algorithm. I do not want to use a package. For reference, I'm implementing this in Python with numpy.
Data Sets
I have two data sets:
- A vocabulary — a list of all words that occur in my documents
- A list of tuples in the format
DocumentID WordID WordCount
My Understanding of EM
My understanding of EM is as follows:
- Take a generative model. In our case, take the multinomial model.
- Initialize the parameters of this model to random or arbitrary values.
- Loop:
- E step — calculate the likelihood that our model explains the data
- M step — update the parameters of our model to maximize the likelihood
The Multinomial Model
First, I load the data into a $D$-by-$W$ matrix $A$, where $D$ is the number of documents and $W$ is the number of words in my vocabulary. For any $1 \leq d \leq D$ and $1 \leq w \leq W$, $A_{dw}$ represents the word count of the word with ID $w$ in document with ID $d$.
My generative model is multinomial. So the likelihood for any document $A_{d}$ given my model is:
$$p(A_{d} | \hat{\alpha}^{[j]}) = \frac{(\sum_{w=1}^{W}{A_{dw}})!}{\prod_{w=1}^{W}{A_{dw}!}} \prod_{w=1}^{W}{(\hat{\alpha}^{[j]})}^{A_{dw}}$$
Where $1 \leq j \leq T$, $T$ is the number of topics, $W$ is the number of words in my vocabulary, and $A_{dw}$ is the word count of of word $w$ in document $A_{d}$. Since the big factorial term at the beginning is independent of $\alpha$, we don't have to worry about it. This simplifies our calculations to:
$$p(A_{d} | \hat{\alpha}^{[j]}) = \prod_{w=1}^{W}{(\hat{\alpha}^{[j]})}^{A_{dw}}$$
Since we have several documents, the join likelihood is:
$$p(A_{1}, A_{2}, …, A_{D}|\hat{\alpha}^{[j]}) = p(A_{1}|\hat{\alpha}^{[j]}) \cdot p(A_{2}|\hat{\alpha}^{[j]}) \cdot \cdot \cdot p(A_{D}|\hat{\alpha}^{[j]})$$
What I've Done
So, after loading my data into the matrix $A$ described above, I initialize parameter vectors $\hat{\alpha}^{[j]}$ where $1 \leq j \leq T$.
I think I understand how to calculate the joint likelihoods given each cluster:
initialize likelihoods vector of length T to all ones
for each likelihood in likelihood vector:
for each document:
multiply likelihood by p( document | alpha_vector )
This will find the joint likelihood for each $\hat{\alpha}^{[j]}$. This is the E step.
So the basic loop for EM will be:
loop until satisfied:
E step (as described above)
M step
Where I'm Confused
I don't know what to do for the M step. I'm pretty sure I need to calculate
$$p(\hat{\alpha}^{[j]} | A)$$ for each topic, but I'm not sure of two things:
- How can I do this?
- Once I have these values, what do I do with them?
Resources I've Found
- Here is a fantastic, short document which explains document classification with multinomials quite well.
- Here's a short paper which discusses multinomial mixtures and touches on LDA.
Best Answer
The EM algorithm maximizes the log-likelihood:
$\theta^{\ast} = \arg\max_\theta \sum_{i=1}^{n} \log p(x_i|\theta) = \arg \max_\theta \sum_{i=1}^{n} \log \big[\sum_{z_i} p(x_i, z_i|\theta)\big]$
where $p(x_i, z_i | \theta)$ is the joint distribution between observed words $x_i$ and unobserved assignments $z_i$ parameterized by $\theta$. In the case of topic models, $\theta = \{\theta_d, \beta_{w|z}\}$ where $\theta_d \sim Dir(\alpha)$ are document-level topic proportions and $\beta_{w|z} \sim Dir(\eta)$ are corpus level topics.
E-step: Given $\theta^{(k)}$, we want to find the soft counts $n(z)$ and $n(w, z)$
First, we compute the posterior over the topic assignments:
$p(z|w, \theta^{(k)}) = \frac{p(z, w | \theta^{(k)})}{\sum_z p(z, w | \theta^{(k)})} = \frac{\theta_{z|d} \beta_{w|z}}{\sum_z \theta_{z|d} \beta_{w|z}}$
Given the posterior $p(z|w, \theta^{(k)})$, we can compute the soft counts:
$n(z) = \sum_{i=1}^{n_d} p(z|w_i, \theta^{(k)})$
$n(w,z) = \sum_{d=1}^{D} \sum_{i=1}^{n_d} p(z|w_i, \theta^{(k)}) \delta(w, w_i)$
where $\delta(w, w_i)$ is an indicator which is equal to 1 when $w=w_i$ and 0 otherwise.
M-step: Given the soft counts $n(z)$ and $n(w, z)$, we want to update $\theta^{(k)}$
$\theta_{z|d}^{(k+1)} = \frac{n(z)}{n_d}$
$\beta_{w|z}^{(k+1)} = \frac{n(w, z)}{\sum_{w \in V} n(w, z)}$
where $n_d$ is the total number of words in document $d$ and $V$ is our vocabulary.
Thus, we can initialize $\theta^{(0)}$ at random and iterate between the E-step and the M-step until convergence. The EM algorithm may require several restarts to avoid local optima. As a result of several iterations, the log-likelihood of observed words under the topic model will increase.