Let's say I run an LDA model with 3 topics on 5 documents.
After the model is learned (with Gibbs sampling presumably), I have topic distribution for each document, shown as the following:
My question is, how do I retrieve document(s) that are "most similar to document-1" ?
In clustering algorithms such as K-means, each document is assigned to one of the K classes. To retrieve doc-1's neighbor documents, I just need to find all the documents that get assigned to the same cluster as doc-1.
What procedure should I do with LDA model?
Best Answer
I found the metrics below to be quite useful. Use these metrics to compare document 1 ($P$ distribution on Wikipedia) to document $i$ ($Q$ distribution on Wikipedia). Repeat this for all documents (iterate through $i$ and replace $Q$ each time) to create a list of distances. Then rank the distances from smallest to biggest - the smallest one will be the most similar to doc-1
Jensen-Shannon Distance
Hellinger Distance
Bhattacharyya distance
Different metrics may return different results - it is up to you which one works best / suits your needs.