Solved – Using LDA to calculate similarity

feature selectionlatent-dirichlet-alloctopic-models

I have a training data set on which I use Latent Dirichlet Allocation(LDA) to generate topics. I would like to use this model on other documents and see how similar they are to the training data. I would like to know how to go about doing this?

I was able to generate topics on the training data set. Gensim gave me a topic distribution. But how do I apply this topic distribution which is the LDA model on other documents to calculate similarity?

Best Answer

(Pseudo-code) Computing similarity between two documents (doc1, doc2) using existing LDA model:

  1. lda_vec1, lda_vec2 = lda(doc1), lda(doc2)
  2. score <- similarity(lda_vec1, lda_vec2)

In the first step, you simply apply your LDA model on the two input documents, getting back a vector for each document. The vector represents the topic distribution for the document.

The second step is to apply a similarity measure of your choice to compare the two vectors. You should experiment with different types of similarity measures to see which one works best in your case. Some good options to consider for distance metrics are cosine distance and Hellinger distance. Note that the underlying assumption here is that we consider two documents to be similar if their presumed topics are similar.

Example using Cosine similarity:

similarity = gensim.matutils.cossim(lda_vec1, lda_vec2)
Related Question