Solved – Labeling documents with short text labels after topic modeling

natural languagetext miningtopic-modelsunsupervised learning

If I generate a topic model (LDA, PLSA) for a group of documents, is there then a way that I could label each document with a one-to-two word label that describes the document content?

For example, if I was modeling local business listings/reviews on yelp, is there a reliable way to generate labels such as "Coffee," "Clothes," etc?

I know this is a lot to ask, but it is a problem that I am interested in and I figured that I'm not the only one. It seems like I might be able to use the highest probability words for a given topic, but each document will probably belong to multiple topics. I might also want to use another process to limit my labels to specific types of words (e.g. nouns).

Best Answer

In general, there's no reason to assume that the distributions over words—topics, in model parlance—will give highest probability to the most natural "label" for the topic.

You can see this in the sample topics and corresponding top words shown in the paper introducing LDA:

"Arts"  "Budgets"   "Chidren"   "Education"
NEW     MILLION     CHILDREN    SCHOOL
FILM    TAX         WOMEN       STUDENTS
SHOW    PROGRAM     PEOPLE      SCHOOLS
MUSIC   BUDGET      CHILD       EDUCATION
MOVIE   BILLION     YEARS       TEACHERS
PLAY    FEDERAL     FAMILIES    HIGH
MUSICAL YEAR        WORK        PUBLIC
BEST    SPENDING    PARENTS     TEACHER
ACTOR   NEW         SAYS        BENNETT
FIRST   STATE       FAMILY      MANIGAT
YORK    PLAN        WELFARE     NAMPHY
OPERA   MONEY       MEN         STATE
THEATER PROGRAMS    PERCENT     PRESIDENT
ACTRESS GOVERNMENT  CARE        ELEMENTARY
LOVE    CONGRESS    LIFE        HAITI

This, I'd argue, shows that the choice of top word is a little problematic. "Children" is both top term and topic, but "arts"—a pretty natural label for the first topic—doesn't even appear in its top words. "Education" is in the top five, "budgets" too if we're flexible. Choosing the top word of each makes no sense in the first two cases, and reasonable sense in the latter.

Of course, these labels were manually, subjectively chosen by the authors, and you might have labelled them differently. I myself would have used "family" instead of "children" for the third. More, the topics would change if you altered $k, \alpha$ or $\eta$.

You can set $k$ to minimize perplexity, but will these be meaningful to typical readers? At what threshold should each mixing proportion reach before labeling a document with that topic?

One extension, supervised LDA (sLDA) introduced by Blei / McAuliffe - Supervised topic models, allows you to both fit distributions over words and model responses. Meaning, if you have or can develop a labelled corpus, you could build models that predict whether a given label applied to a new document. This would let you dodge the problem of finding meaning in fitted topics: You would start with what's meaningful and fit topics to it.