Solved – Topic modeling (LDA) gives different outputs

bayesianmachine learningtopic-modelsunsupervised learningvariational-bayes

I am using Topic Modeling Tool which is based on Mallet and using latent dirichlet allocation (LDA). When I ran the tool multiple times, with the same input (a folder of 200-500 short text files), and set Topics = 10 with the default of 10 words printed per each topic, the generated 10 topics and the words in the output changes.

So am i supposed to reran it multiple times and how can I pick one?
Does it matter which tool of topic modelling I use? I found a few other tools in my search including Stanford Topic Modeling Toolbox, Natural Language Toolkit, etc.

Best Answer

As Cliff notes in comments, the objective function for LDA is non-convex, making it a multimodal problem. That is, you can expect any given run to be locally optimal; you cannot expect that any given run would outperform some other run from different starting points.

To choose between multiple runs, consider this from another paper on variational methods by LDA creators David Blei and Michael Jordan (emphasis mine):

Practical applications of variational methods must address initialization of the variational distribution. While the algorithm yields a bound for any starting values of the variational parameters, poor choices of initialization can lead to local maxima that yield poor bounds. We initialize the variational distribution by incrementally updating the parameters according to a random permutation of the data points. (This can be viewed as a variational version of sequential importance sampling). We run the algorithm multiple times and choose the final parameter settings that give the best bound on the marginal likelihood.

They were writing about Dirichlet process mixtures, but the principle of selecting the run that best predicts the data carries. Also consider selecting based on perplexity of a held-out test set.