Solved – LDA: find percentage / number of documents per topic

latent-dirichlet-allocpythontopic-models

I'm using LDA to find topics in a corpus. Everything works fine (I have the topics). But I would like to have the percentage / number of documents in each topic. It's possible? I looked at scikit-learn and gensim and could not implement anything.

Best Answer

The technically correct, but completely useless answers to your questions:

Number of documents in each topic: the number of documents in your corpus. Percentage of documents in each topic: 100%

Why is that?

LDA generates vector spaces (among many other things it does). Each document gets a probability of belonging to a specific topic for all topics and each topic is comprised of all documents with each having a topic-specific weight. Even if the probability of a document belonging to a topic is zero, it still gets generated (in practice, such a probability is never zero, just infitesimally small). Similarly, a topic is comprised of all documents, even if the document weight is 0.0000001.

What I think you want to see

For gensim (I dont know the scikit code by heart): Have you tried something like gensim's built-in functions show_topics() (broad overview of selected topics) or print_topics()(to get a single topic as a string) or get_topic_terms() (to get word-id and probabilities)

They usually return the top 10 most important topics or words by default, but you can of course play with the settings. You can check the documentation of gensim.models.ldamodel here.

In practice, you might want to use something like this

for id, topic in lda_model.show_topics(num_topics=20,num_words=10):
   print('Topic: {} \nWords: {}'.format(id, topic))

to get something like

Topic: 0 Words: 0.013*"change" + 0.012*"company" + 0.011*"dance"...

which you can reformat to your liking.

Some more inspiration can be found here.