Solved – Using topic words generated by LDA to represent a document

feature selectionlatent-dirichlet-alloctext miningtopic-models

I want to do document classification by representing each document as a set of features. I know that there are many ways: BOW, TFIDF, …

I want to use Latent Dirichlet Allocation (LDA) to extract the topic keywords of EACH SINGLE document. the document is represented by these topic words. But I do not know whether it is reasonable because in my opinion LDA is usually used to extract the topic words shared by A BUNCH OF documents.

Can LDA be used to detect the topic of A SINGLE document?

Best Answer

Can LDA be used to detect the topic of A SINGLE document?

Yes, in its particular representation of 'topic,' and given a training corpus of (usually related) documents.

LDA represents topics as distributions over words, and documents as distributions over topics. That is, one very purpose of LDA is to arrive at probabilistic representation of each document as a set of topics. For example, the LDA implementation in gensim can return this representation for any given document.

But this depends on the other documents in the corpus: Any given document will have a different representation if analyzed as part of a different corpus.

That's not typically considered a shortcoming: Most applications of LDA focus on related documents. The paper introducing LDA applies it to two corpora, one of Associated Press articles and one of scientific article abstracts. Edwin Chen's nicely approachable blog post applies LDA to a tranche of emails from Sarah Palin's time as Alaska governor.

If your application demands separating documents into known, mutually exclusive classes, then LDA-derived topics can be used as features for classification. Indeed, the initial paper does just that with the AP corpus, with good results.

Relatedly, Chen's demonstration doesn't sort documents into exclusive classes, but his documents' mostly concentrate their probability on single LDA topics. As David Blei explains in this video lecture, the Dirichlet priors can be chosen to favor sparsity. More simply, "a document is penalized for using many topics," as his slides put it. This seems the closest LDA can get to a single, unsupervised topic, but certainly doesn't guarantee every document will be represented as such.

Related Question