Solved – Comparing topic distributions between corpora using Latent Dirichlet Allocation and R topicmodels or python gensim

dirichlet-processmachine learningtext miningtopic-models

So I am working on a problem where I want to extract a set of LDA topics from one corpus, and then compare the distribution of those topics in other corpora. So basically I want to lock-in the topics and then get a sense of how similar or different another corpus is from the original corpus. I was hoping that someone could tell me the tool or approach to do this type of comparison?

My particular application has to do with comparing local versus national newspapers. I have a corpus of national newspaper articles and I have already used gensim to extract the topics. Now I have corpora of local newspapers captured during the same period of time. So I want to compare the distribution of identical topics in the national newspaper versus the local newspapers. Of course, I would also like to look at the structure of the topic in both the national and local corpora (such as the change in probability of co-occurrence of words for the same topic in the two different corpora).

I looked around in the R topicmodels and the python gensim packages, but had no luck. Any suggestions?

Best Answer

You have to train your model, get the topics distribution for both the corpus you want to compare and then you need to choose a metric to compare them. For example, the topic distributions are vectors, and you can use the euclidian distance between them as an indicator of the difference between the documents.

EDIT - (example)

With gensim, you'll have to do something like that:

#Train your LDA model    
lda = LdaModel(national_corpus, num_topics=10)

# Get the mean of all topic distributions in one corpus
national_topic_vectors = []
for newspaper in national_corpus:
    national_topic_vectors.append(lda[newspaper])
national_average = numpy.average(numpy.array(national_topic_vectors), axis=0)

# Get the mean of all topic distributions in another corpus
regional_topic_vectors = []
for newspaper in regional_corpus:
    regional_topic_vectors.append(lda[newspaper])
regional_average = numpy.average(numpy.array(regional_topic_vectors), axis=0)

# Calculate the distance between the distribution of topics in both corpora
difference_of_distributions = numpy.linalg.norm(national_average - regional_average)