Solved – Can a labeled LDA (Latent Dirichlet Allocation) dataset have just one label per document

classificationmachine learningnatural languagetopic-models

I understand that in labeled LDA, every document should be associated with a set of labels which are known as tagged topics for the respective document.

My question is whether a document can be tagged with just one label and does it still make sense to go ahead and train a labeled LDA on the corpus of documents, where each of them is tagged with only one topic/label among a fixed set of labels.

Furthermore, can such a system/model be used as a multiclass classifier so that given an unlabled document, the model can assign one of the labels to the test document?

Best Answer

There's nothing stopping you, but this essentially reduces to learning a bag of words model for each label, albeit with a shared prior in the form of $\eta$. The new model would look like this:

Reduced labelled LDA model

To see why these are equivalent, see this snippet from the labelled LDA paper:

The traditional LDA model then draws a multinomial mixture distribution $\theta^{(d)}$ over all $K$ topics, for each document $d$, from a Dirichlet prior $\alpha$. However, we would like to restrict $\theta^{(d)}$ to be defined only over the topics that correspond to its labels $\Lambda(d)$. Since the word-topic assignments $z_i$ (see step 9 in Table 1) are drawn from this distribution, this restriction ensures that all the topic assignments are limited to the document’s labels.

If the document has only a single—and importantly, observed—label, its topic assignment is limited to the corresponding topic, and all its words are generated from the same multinomial distribution. (This is because $\Lambda(d)$ will ensure that only one value of $\theta^{(d)}$ is nonzero.)

It bears superficial resemblance to a mixture of unigrams, where each document is produced by a single topic. But in that model the topic is a latent variable, and in your case it's observed. Cf. the mixture of unigrams model as described in the original LDA paper:

Under this mixture model, each document is generated by first choosing a topic $z$ and then generating $N$ words independently from the conditional multinomial $p(w|z)$. [...] When estimated from a corpus, the word distributions can be viewed as representations of topics under the assumption that each document exhibits exactly one topic.

There's nothing wrong with bag of words, but it's worth noting that the paper introducing LDA demonstrated better performance, in terms of perplexity, in two experiments. (Figure 9.)

To the question about classification, sure: Each bag of words model will give you a likelihood for the document, and you can use this along with a prior on topics to find $p(z|\textbf{w})$ using Bayes rule. (If your prior on topics is uniform, this is equivalent to maximum likelihood.)

You've only asked whether this is possible, and not about likely performance. But for what it's worth, my intuition is that you'll get better predictive performance with regular LDA and a subsequent classifier. When in doubt, cross validate.