Solved – Using topic words generated by LDA to represent a document

feature selectionlatent-dirichlet-alloctext miningtopic-models

I want to do document classification by representing each document as a set of features. I know that there are many ways: BOW, TFIDF, …

I want to use Latent Dirichlet Allocation (LDA) to extract the topic keywords of EACH SINGLE document. the document is represented by these topic words. But I do not know whether it is reasonable because in my opinion LDA is usually used to extract the topic words shared by A BUNCH OF documents.

Can LDA be used to detect the topic of A SINGLE document?

Best Answer

Can LDA be used to detect the topic of A SINGLE document?

Yes, in its particular representation of 'topic,' and given a training corpus of (usually related) documents.

LDA represents topics as distributions over words, and documents as distributions over topics. That is, one very purpose of LDA is to arrive at probabilistic representation of each document as a set of topics. For example, the LDA implementation in gensim can return this representation for any given document.

But this depends on the other documents in the corpus: Any given document will have a different representation if analyzed as part of a different corpus.

That's not typically considered a shortcoming: Most applications of LDA focus on related documents. The paper introducing LDA applies it to two corpora, one of Associated Press articles and one of scientific article abstracts. Edwin Chen's nicely approachable blog post applies LDA to a tranche of emails from Sarah Palin's time as Alaska governor.

If your application demands separating documents into known, mutually exclusive classes, then LDA-derived topics can be used as features for classification. Indeed, the initial paper does just that with the AP corpus, with good results.

Relatedly, Chen's demonstration doesn't sort documents into exclusive classes, but his documents' mostly concentrate their probability on single LDA topics. As David Blei explains in this video lecture, the Dirichlet priors can be chosen to favor sparsity. More simply, "a document is penalized for using many topics," as his slides put it. This seems the closest LDA can get to a single, unsupervised topic, but certainly doesn't guarantee every document will be represented as such.

Related Solutions

Solved – A single document as input to LDA

You can use a sentence splitter and split your document into sentences. I have never used the approach myself, but the tool is available with the open.nlp package in R, Python and Rapidminer.

What you could also do is to train a topicmodel on corpus with clearly defined topics. Next you use the same model on your one document and you see how the topic structure turn out.

Solved – Using LDA to calculate similarity

(Pseudo-code) Computing similarity between two documents (doc1, doc2) using existing LDA model:

lda_vec1, lda_vec2 = lda(doc1), lda(doc2)
score <- similarity(lda_vec1, lda_vec2)

In the first step, you simply apply your LDA model on the two input documents, getting back a vector for each document. The vector represents the topic distribution for the document.

The second step is to apply a similarity measure of your choice to compare the two vectors. You should experiment with different types of similarity measures to see which one works best in your case. Some good options to consider for distance metrics are cosine distance and Hellinger distance. Note that the underlying assumption here is that we consider two documents to be similar if their presumed topics are similar.

Example using Cosine similarity:

similarity = gensim.matutils.cossim(lda_vec1, lda_vec2)

Best Answer

Related Solutions

Solved – A single document as input to LDA

Solved – Using LDA to calculate similarity

Related Question