Solved – EM to Variational EM in LDA

inferencemachine learningtopic-models

Why exactly, when learning hidden variables distribution in LDA (Latent Dirichlet Allocation), one cannot use the EM (Expectation Maximization) algorithm and have to resort to a variational EM typically (or sampling method)?

Variational inference was the original method proposed for parameter inference in the 2003 paper from Blei and Al.

Best Answer

I do not see why you cannot use EM for LDA. To apply EM to LDA: In the E-step, you fix $\theta$ (the topic distribution of the document) and $\phi$ (the word distribution under a topic) and compute the distribution $q(z)=p(z|x,\theta,\phi)$ ($z$ is the topic assignment of each word). In the M-step, you update $\theta$ and $\phi$ to optimize the expected log likelihood, where the expectation is taken based on $q(z)$.

Of course, if you use less-than-one hyperparameters for the Dirichlet distributions, then you cannot use EM because in the M-step the expected log likelihood would contain Dirichlet over $\theta$ and $\phi$ that may have multiple modes (i.e., some of the parameters of the Dirichlet may be less than 1, leading to infinite probability density at the corners/edges of the simplex, as shown below).

Related Solutions

Solved – What are the main differences between classical and Gibbs sampling Latent Dirichlet Allocations

LDA is a probabilistic graphical model for a document generating process as explained in Blei et. al in JMLR 2003(For more intuition on generative process see http://videolectures.net/mlss09uk_blei_tm/). Now the main and obvious idea here is that we can generate documents if we knew the parameters, in a sense in which we have modeled our document. But the problem here is that we do not know the parameters. So we use our Bayes rule to invert the generative process. Now we want to model the uncertainty in the parameters given the data, in a sense in which we have our model. That is we need to infer the unknowns given the data.

In both cases, the model is the same. But the way in which we are inferring is different. The method used by Blei et al is called variational inference and the one used by Griffiths is sampling based inference. Both are approximate inference methods, one being in MCMC class(Griffiths) and other being the variational class(Blei).

In variational inference, say we have a complex multimodal distribution on which we cannot infer. What we want is to approximate that complex multimodal distribution with a simpler distribution(called Q in the literature see (3)). We do this by choosing a simpler family of distributions, either by explicitly choosing a simpler family in parametric form as in LDA by Blei 2003 or just deciding the factorized form as in (5) by Bishop for a normal distribution example(Here without explicitly choosing the parametric form, this is why it is also called free form optimization). Other important difference from gibbs sampling is that the simpler distribution(1) locks on to one of the modes(2) of the complex distribution which we could not handle but in Gibbs sampling we visit the modes all over.

With respect to variational inference, it is easier to look at gibbs sampling. In Gibbs sampling, we form an integral of the statistics we are interested in(probabilities can also be expressed as integrals). Now we use monte carlo approximation to the integral we formed by using samples from the distribution. Again here also it is difficult to sample from the complex distribution(if applicable) and we turn to a familiar distribution from which we are capable of sampling. There are a lot of hacks and improvement to this basic description.

For more details see (3) (4)

(2)(which comes from the zero forcing nature of reverse KL divergence used as cost. Understanding zero forcing is subtle and nice. Suppose you want to lock the unnormalized unimodal distribution onto one of the modes of complex multimodal distribution. Now we need some imagination. What happens in zero forcing is that where the complex distribution is zero the simpler multimodal distribution is forced to be zero and because the simpler distribution's is mostly chosen to be unimodal so it has no choice but to slip into one of the modes(zero focing effect). If you think in terms of unnormalized distribution because we are interested in parameters, it seems cool that the unimodal slips into one of the modes of multimodal).

(1)(which is unimodal in the case of mean field approximation)

(3) Machine Learning a Probabilistic Perspective: Kevin Murphy chapter 21 Variational Inference chapter 22 More Variational Inference chapter 23 Monte Carlo Inference chapter 24 Markov Chain Monte Carlo Inference

(4) Graphical Models, Exponential Families, and Variational Inference: Martin Wainwright and Michael Jordan

(5) Pattern Recognition and Machine Learning : Bishop

Solved – Understanding LDA inference

How to classify documents

Importantly, latent Dirichlet allocation is an unsupervised method: On its own, it doesn't account for the class or category of a document. But, as discussed in section 7.2 of the paper that introduced it, it can be used to develop features for classification:

A challenging aspect of the document classification problem is the choice of features. Treating individual words as features yields a rich but very large feature set (Joachims, 1999). One way to reduce this feature set is to use an LDA model for dimensionality reduction. In particular, LDA reduces any document to a fixed set of real-valued features—the posterior Dirichlet parameters $\gamma\ast(\textbf{w})$ associated with the document.

So as a general, practical answer to your second question: Parameters of the topic distribution for a document can be used as features in a classifier of your choice. That's exactly what the authors of LDA did in their experiments:

In these experiments, we estimated the parameters of an LDA model on all the documents, without reference to their true class label. We then trained a support vector machine (SVM) on the low-dimensional representations provided by LDA and compared this SVM to an SVM trained on all the word features.

Here's an example of what this could look like in python. It transforms the digits dataset from sklearn to a 16-topic space, then predicts the digit using logistic regression. (Sixteen chosen rather arbitrarily after some exploration in my answer here.)

# -*- coding: utf-8 -*-
"""
Created on Fri May 27 15:24:16 2016

@author: SeanEaster
"""

from sklearn.decomposition import LatentDirichletAllocation as LDA
from sklearn.datasets import load_digits
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import train_test_split
from sklearn.metrics import confusion_matrix

import numpy as np

digits = load_digits()

images = digits['images']
images = [image.reshape((1,-1)) for image in images]
images = np.concatenate(tuple(images), axis = 0)

lda = LDA(n_topics = 16) 

X = lda.fit_transform(images)
Y = digits['target']

xTrain, xTest, yTrain, yTest = train_test_split(X,Y,test_size =.2, random_state=9)

classifier = LogisticRegression(C = 1e5) # Choice of C here is arbitrary; in practice, cross validate
classifier.fit(X,Y)
print confusion_matrix(yTest, classifier.predict(xTest))

Which gives reasonable results—here's the confusion matrix it prints:

[[33  0  0  0  0  0  0  0  0  0]
 [ 0 36  1  0  0  0  0  0  2  1]
 [ 0  1 40  2  0  0  0  0  2  0]
 [ 0  0  0 32  0  0  0  0  0  2]
 [ 0  0  0  0 36  0  0  4  0  1]
 [ 0  0  0  1  0 34  0  0  0  4]
 [ 0  0  0  0  0  0 29  0  0  0]
 [ 0  0  0  0  0  0  0 27  1  0]
 [ 0  6  1  0  1  1  0  1 25  1]
 [ 0  0  0  1  1  1  0  0  3 29]]

For a text application, see this classification example from the sklearn docs.

Uses for the posterior distributions

To your first question, there are still uses for LDA topics outside of classification, namely that extracted topics can give a descriptive summary of a corpus. Another sklearn example does this on the 20 newsgroups dataset, and prints the top words of topics. Here's it's output:

Topics in LDA model:
Topic #0:
government people mr law gun state president states public use right rights national new control american security encryption health united
Topic #1:
drive card disk bit scsi use mac memory thanks pc does video hard speed apple problem used data monitor software
Topic #2:
said people armenian armenians turkish did saw went came women killed children turkey told dead didn left started greek war
Topic #3:
year good just time game car team years like think don got new play games ago did season better ll
Topic #4:
10 00 15 25 12 11 20 14 17 16 db 13 18 24 30 19 27 50 21 40
Topic #5:
windows window program version file dos use files available display server using application set edu motif package code ms software
Topic #6:
edu file space com information mail data send available program ftp email entry info list output nasa address anonymous internet
Topic #7:
ax max b8f g9v a86 pl 145 1d9 0t 34u 1t 3t giz bhj wm 2di 75u 2tm bxn 7ey
Topic #8:
god people jesus believe does say think israel christian true life jews did bible don just know world way church
Topic #9:
don know like just think ve want does use good people key time way make problem really work say need

You can already see some intuitive overlap with the newsgroup names, described here, e.g. talk.politics.guns, talk.religion.misc. You can carry this descriptive analysis further, but exactly how depends much on your interest.