Python – Choosing Words in a Topic: Optimal Cut-Off for LDA Topics

pythontopic-models

My question is what is a good cut-off threshold for LDA topics?

I used the code in this blog post Topic modeling with latent Dirichlet allocation in Python

In that code, the author shows the top 8 words in each topic, but is that the best choice?

For each topic distribution, each word has a probability and all the words probabilities add up to 1.0

I wrote this code to print out down to an epsilon threshold:

eps=0.01
for i, topic_dist in enumerate(topic_word):         
    wordindex=np.argsort(topic_dist)[::-1] #rev sort
    w=topic_dist[wordindex] ## this is the length of all the unique words 4258
    words=[np.array(vocab)[wordindex[j]] for j in range(min(n_top_words,len(wordindex))) if w[j]>eps ]
    weights=['{:.3f}'.format(w[j]) for j in range(min(n_top_words,len(wordindex))) if w[j]>eps ]
    print('Topic {}: {}; {}'.format(i, ', '.join(words),', '.join(weights)))

Looking at another library gensim LdaModel, it appears that LDA most likely does not originally have the probabilities sum up to 1.0 like that and they are normalized, see below:

def show_topic(self, topicid, topn=10):
    topic = self.state.get_lambda()[topicid]
    topic = topic / topic.sum() # normalize to probability dist
    ...

Running the sample code Latent Dirichlet Allocation (LDA) with Python and calling get_lambda, one can see the lambda values are sometimes above 1.0.

ldamodel.state.get_lambda()[1]

gives:

array([ 1.48214337,  1.48168697,  0.50442377,  0.50399559,  0.50400832,
    0.5047193 ,  0.50375875,  0.50376053,  1.50224118,  0.50376574,
    0.5037527 ,  0.50377459,  0.50376621,  1.49831418,  1.49832577,
    1.49831855,  1.49831883,  1.49831596,  1.51053093,  3.49684196,
    1.49832204,  1.49832512,  0.50316907,  0.50321838,  0.50328253,
    0.50319543,  0.50317986,  0.50318815,  0.50314213,  0.5031702 ,
    1.49635267,  1.49634655])

What is the best eps to choose? Is it better to not to normalize the prob dist and use the original value in a cut-off? Is it best to use the max prob value in each topic and base a cut-off from that?

In my actual datasets, sometimes an eps of 0.01, actually creates a word-less topic!!

Update

Playing with different numbers of topics, I noticed that if I have 2 topics with the load_reuters data, I get this with an eps=0.01

Topic 0: ; 
Topic 1: pope; 0.013

I believe that Topic 0 can be interpreted as everything else or there needs to be more topics.

arr=[]
for n in (range(2,50)):
  model = lda.LDA(n_topics=n, n_iter=20, random_state=1)
  model.fit(X)
  topic_word = model.topic_word_  # model.components_ also works
  arr.append(
    (min([max(topic_word[i]) for i in range(model.n_topics)]),
     max([max(topic_word[i]) for i in range(model.n_topics)])))
plt.plot(arr)
...

So looking at this chart, n is too low if below 5 and flattens out sometime after 20…

lda

Best Answer

It is important to remember that topic models such as LDA were primarily developed for unsupervised text summarization. So often, there is not a "best" choice for how many top words to show. Most research papers on topic models tend to use the top 5-20 words. If you use more than 20 words, then you start to defeat the purpose of succinctly summarizing the text.

A tolerance $\epsilon > 0.01$ is far too low for showing which words pertain to each topic. A primary purpose of LDA is to group words such that the topic words in each topic are highly probable within that topic. If such a low threshold is chosen, then many, many words will appear in each topic, again defeating the purpose of succinct text summarization. To extract the most probable words, you would be better off choosing a threshold of $\epsilon > 0.9$ or maybe $\epsilon > 0.8$.

The issue of seeing wordless topics in general when using Gensim is probably because Gensim has its own tolerance parameter "minimum_probability". This parameter defaults to 0.01 (this is explained in the Gensim LDA documentation). If you want to see all the words per topic, regardless of their low probability of appearing in the topic, you can set minimum_probability = 0.

For LDA, you are best off using the normalized probabilities (using "get_topic_terms" function through the ldamodel) because they are the most interpretable. I am not intimately familiar with how Gensim estimates the topic-word probabilities, but the unnormalized values are probably a result of Bayesian estimation where it's not relevant to directly estimate the denominator because (as you've said) it's just normalization.

Related Question