Solved – How to find “similar documents” after a Latent Dirichlet Allocation model is built

information retrievallatent-dirichlet-alloctext mining

Let's say I run an LDA model with 3 topics on 5 documents.

After the model is learned (with Gibbs sampling presumably), I have topic distribution for each document, shown as the following:

enter image description here

My question is, how do I retrieve document(s) that are "most similar to document-1" ?

In clustering algorithms such as K-means, each document is assigned to one of the K classes. To retrieve doc-1's neighbor documents, I just need to find all the documents that get assigned to the same cluster as doc-1.

What procedure should I do with LDA model?

Best Answer

I found the metrics below to be quite useful. Use these metrics to compare document 1 ($P$ distribution on Wikipedia) to document $i$ ($Q$ distribution on Wikipedia). Repeat this for all documents (iterate through $i$ and replace $Q$ each time) to create a list of distances. Then rank the distances from smallest to biggest - the smallest one will be the most similar to doc-1

Different metrics may return different results - it is up to you which one works best / suits your needs.