Solved – How to find “similar documents” after a Latent Dirichlet Allocation model is built

information retrievallatent-dirichlet-alloctext mining

Let's say I run an LDA model with 3 topics on 5 documents.

After the model is learned (with Gibbs sampling presumably), I have topic distribution for each document, shown as the following:

My question is, how do I retrieve document(s) that are "most similar to document-1" ?

In clustering algorithms such as K-means, each document is assigned to one of the K classes. To retrieve doc-1's neighbor documents, I just need to find all the documents that get assigned to the same cluster as doc-1.

What procedure should I do with LDA model?

Best Answer

I found the metrics below to be quite useful. Use these metrics to compare document 1 ($P$ distribution on Wikipedia) to document $i$ ($Q$ distribution on Wikipedia). Repeat this for all documents (iterate through $i$ and replace $Q$ each time) to create a list of distances. Then rank the distances from smallest to biggest - the smallest one will be the most similar to doc-1

Different metrics may return different results - it is up to you which one works best / suits your needs.

Related Solutions

Text Mining – Topic Prediction Using Latent Dirichlet Allocation

I'd try 'folding in'. This refers to taking one new document, adding it to the corpus, and then running Gibbs sampling just on the words in that new document, keeping the topic assignments of the old documents the same. This usually converges fast (maybe 5-10-20 iterations), and you don't need to sample your old corpus, so it also runs fast. At the end you will have the topic assignment for every word in the new document. This will give you the distribution of topics in that document.

In your Gibbs sampler, you probably have something similar to the following code:

// This will initialize the matrices of counts, N_tw (topic-word matrix) and N_dt (document-topic matrix)
for doc = 1 to N_Documents
    for token = 1 to N_Tokens_In_Document
       Assign current token to a random topic, updating the count matrices
    end
end

// This will do the Gibbs sampling
for doc = 1 to N_Documents
    for token = 1 to N_Tokens_In_Document
       Compute probability of current token being assigned to each topic
       Sample a topic from this distribution
       Assign the token to the new topic, updating the count matrices
    end
end

Folding-in is the same, except you start with the existing matrices, add the new document's tokens to them, and do the sampling for only the new tokens. I.e.:

Start with the N_tw and N_dt matrices from the previous step

// This will update the count matrices for folding-in
for token = 1 to N_Tokens_In_New_Document
   Assign current token to a random topic, updating the count matrices
end

// This will do the folding-in by Gibbs sampling
for token = 1 to N_Tokens_In_New_Document
   Compute probability of current token being assigned to each topic
   Sample a topic from this distribution
   Assign the token to the new topic, updating the count matrices
end

If you do standard LDA, it is unlikely that an entire document was generated by one topic. So I don't know how useful it is to compute the probability of the document under one topic. But if you still wanted to do it, it's easy. From the two matrices you get you can compute $p^i_w$, the probability of word $w$ in topic $i$. Take your new document; suppose the $j$'th word is $w_j$. The words are independent given the topic, so the probability is just $$\prod_j p^i_{w_j}$$ (note that you will probably need to compute it in log space).

Latent Dirichlet Allocation – How to Calculate Perplexity of a Holdout?

This is indeed something often glossed over.

Some people are doing something a bit cheeky: holding out a proportion of the words in each document, and giving using predictive probabilities of these held-out words given the document-topic mixtures as well as the topic-word mixtures. This is obviously not ideal as it doesn't evaluate performance on any held-out documents.

To do it properly with held-out documents, as suggested, you do need to "integrate over the Dirichlet prior for all possible topic mixtures". http://people.cs.umass.edu/~wallach/talks/evaluation.pdf reviews a few methods for tackling this slightly unpleasant integral. I'm just about to try and implement this myself in fact, so good luck!

Best Answer

Related Solutions

Text Mining – Topic Prediction Using Latent Dirichlet Allocation

Latent Dirichlet Allocation – How to Calculate Perplexity of a Holdout?

Related Question