I am using Topic Modeling Tool which is based on Mallet and using latent dirichlet allocation (LDA). When I ran the tool multiple times, with the same input (a folder of 200-500 short text files), and set Topics = 10 with the default of 10 words printed per each topic, the generated 10 topics and the words in the output changes.
So am i supposed to reran it multiple times and how can I pick one?
Does it matter which tool of topic modelling I use? I found a few other tools in my search including Stanford Topic Modeling Toolbox, Natural Language Toolkit, etc.
Best Answer
As Cliff notes in comments, the objective function for LDA is non-convex, making it a multimodal problem. That is, you can expect any given run to be locally optimal; you cannot expect that any given run would outperform some other run from different starting points.
To choose between multiple runs, consider this from another paper on variational methods by LDA creators David Blei and Michael Jordan (emphasis mine):
They were writing about Dirichlet process mixtures, but the principle of selecting the run that best predicts the data carries. Also consider selecting based on perplexity of a held-out test set.