Solved – Interpreting the result of topic modelling using Latent Dirichlet Allocation

topic-models

I have an unpublished corpus consisting of 135 texts representing one year each. I do topic modelling (using Mallet) and then inspect the topic
distribution over time. The overall picture looks good: In the contribution table, showing the topic-composition of the single texts, every dozen years a new topic dominates and the topics have nice bell shaped curves in a topic percentage-time diagram with peak percentages in the range of 50 to 70.

However, the oldest and the most recent years behave differently: There the respective topics rise to really high values (above 90%, in one case even above 99%). But: There is no reason to think that topics should behave differently just because there are no older or newer data available.

My question is: How are these artefacts explained? Are there measures to mitigate the artefacts?

EDIT: I used the full corpus for training the topics. The topics are kind of buzzwords typically for the respective years.

I used a "small" number of topics (10) and every topic is for some time the dominating topic in the corpus (with more than 50% contribution for some years).

Intuitively spoken, Latent Dirichlet Allocation cannot see the topics from the past or the future (outside of the corpus) that should give some 10–20% contribution to the topic mix at the boundaries.

EDIT2: I used MALLET's default values for the hyperparameters, i.e., alpha=50.0 beta=0.01 gamma=0.01 delta=0.03 delta1=0.2 delata2=1000.0. I also used the defaults on –use-symmetric-alpha (false) and –use-ngrams (false)

Best Answer

I'm not familiar with Mallet, but that sounds very much like the topic distribution for each document. If terms go in and out of style over the life of the corpus, it's sensible that certain topics would collect terms only used in the earliest and latest documents. What's strange is that these documents have so few words from the middle topics.

As you allude to in your second edit, the hyper-parameters are related to this. One can tune the hyper parameters to bias documents and topics to a level of sparsity sensible for the task at hand. For instance, on a corpus of breaking news documents, you may only want the topic distribution for each document to have ~2-3 topics of non-negligible contribution. In your case, it seems the documents are too sparse to be intuitively sensible for your application.

For more on this relationship, including guidance on which parameters to tune, I'd consult David Blei's lecture here.