Solved – Comparing topic distributions between corpora using Latent Dirichlet Allocation and R topicmodels or python gensim

dirichlet-processmachine learningtext miningtopic-models

So I am working on a problem where I want to extract a set of LDA topics from one corpus, and then compare the distribution of those topics in other corpora. So basically I want to lock-in the topics and then get a sense of how similar or different another corpus is from the original corpus. I was hoping that someone could tell me the tool or approach to do this type of comparison?

My particular application has to do with comparing local versus national newspapers. I have a corpus of national newspaper articles and I have already used gensim to extract the topics. Now I have corpora of local newspapers captured during the same period of time. So I want to compare the distribution of identical topics in the national newspaper versus the local newspapers. Of course, I would also like to look at the structure of the topic in both the national and local corpora (such as the change in probability of co-occurrence of words for the same topic in the two different corpora).

I looked around in the R topicmodels and the python gensim packages, but had no luck. Any suggestions?

Best Answer

You have to train your model, get the topics distribution for both the corpus you want to compare and then you need to choose a metric to compare them. For example, the topic distributions are vectors, and you can use the euclidian distance between them as an indicator of the difference between the documents.

EDIT - (example)

With gensim, you'll have to do something like that:

#Train your LDA model    
lda = LdaModel(national_corpus, num_topics=10)

# Get the mean of all topic distributions in one corpus
national_topic_vectors = []
for newspaper in national_corpus:
    national_topic_vectors.append(lda[newspaper])
national_average = numpy.average(numpy.array(national_topic_vectors), axis=0)

# Get the mean of all topic distributions in another corpus
regional_topic_vectors = []
for newspaper in regional_corpus:
    regional_topic_vectors.append(lda[newspaper])
regional_average = numpy.average(numpy.array(regional_topic_vectors), axis=0)

# Calculate the distance between the distribution of topics in both corpora
difference_of_distributions = numpy.linalg.norm(national_average - regional_average)

Related Solutions

Text Mining – Topic Prediction Using Latent Dirichlet Allocation

I'd try 'folding in'. This refers to taking one new document, adding it to the corpus, and then running Gibbs sampling just on the words in that new document, keeping the topic assignments of the old documents the same. This usually converges fast (maybe 5-10-20 iterations), and you don't need to sample your old corpus, so it also runs fast. At the end you will have the topic assignment for every word in the new document. This will give you the distribution of topics in that document.

In your Gibbs sampler, you probably have something similar to the following code:

// This will initialize the matrices of counts, N_tw (topic-word matrix) and N_dt (document-topic matrix)
for doc = 1 to N_Documents
    for token = 1 to N_Tokens_In_Document
       Assign current token to a random topic, updating the count matrices
    end
end

// This will do the Gibbs sampling
for doc = 1 to N_Documents
    for token = 1 to N_Tokens_In_Document
       Compute probability of current token being assigned to each topic
       Sample a topic from this distribution
       Assign the token to the new topic, updating the count matrices
    end
end

Folding-in is the same, except you start with the existing matrices, add the new document's tokens to them, and do the sampling for only the new tokens. I.e.:

Start with the N_tw and N_dt matrices from the previous step

// This will update the count matrices for folding-in
for token = 1 to N_Tokens_In_New_Document
   Assign current token to a random topic, updating the count matrices
end

// This will do the folding-in by Gibbs sampling
for token = 1 to N_Tokens_In_New_Document
   Compute probability of current token being assigned to each topic
   Sample a topic from this distribution
   Assign the token to the new topic, updating the count matrices
end

If you do standard LDA, it is unlikely that an entire document was generated by one topic. So I don't know how useful it is to compute the probability of the document under one topic. But if you still wanted to do it, it's easy. From the two matrices you get you can compute $p^i_w$, the probability of word $w$ in topic $i$. Take your new document; suppose the $j$'th word is $w_j$. The words are independent given the topic, so the probability is just $$\prod_j p^i_{w_j}$$ (note that you will probably need to compute it in log space).

Machine Learning – Topic Stability in Topic Models

For my own curiosity, I applied a clustering algorithm that I've been working on to this dataset.

I've temporarily put-up the results here (choose the essays dataset).

It seems like the problem is not the starting points or the algorithm, but the data. You can 'reasonably' (subjectively, in my limited experience) get good clusters even with 147 instances as long as there is some hidden topics/concepts/themes/clusters (whatever you would like to call).

If the data does not have well separated topics, then no matter whichever algorithm you use, you might not get good answers.

Best Answer

Related Solutions

Text Mining – Topic Prediction Using Latent Dirichlet Allocation

Machine Learning – Topic Stability in Topic Models

Related Question