Text Mining – Why Is the Same Topic Appearing for All Points in LDA Test Set?

natural languagetext miningtopic-models

I am working on a dataset which are rather unusual in the following ways:

  1. It doesn't have just natural language text, it has text like user name, even code snippets etc.
  2. Unusually large vocabulary (unique tokens) size (2M) for a 750K set of documents and about 19M tokens.

All the aspects of the dataset are important and have to be included in the training i.e. the usernames, code snippets etc.

I trained an Latent Dirichet Allocation (LDA) after tokenization, removal of stop words and stemming. Training set size is 720K which about 16M tokens. I trained for 200 and 300 topics and 50 and 100 passes over training data.

I was testing on the test set to see a distribution of first 5 most probable topics of each document in the test set.

What I found was that it is falling the Zip's law for both 200 and 300 number of topics.

Can someone explain why this is happening? Less training or more training or what could be reason?

Attached is the distribution of 200(orange) topics and 300(blue) topics. (Sorry about the wrong title.) The graphs are plotted by extracting top-5 topics of each document and then counting the value for each topic i.e. topic-frequency in test set and plotting the frequency in decreasing order.

enter image description here

enter image description here

Best Answer

My first bet would be that the function words in a corpus of source code differ vastly from those of standard stop lists, and that your model's first topic is indeed capturing standard programming fare: if, int, new, while, etc.

Besides building a custom stop list—seeing which words have high probability under the most frequently assigned topics is a good place to start—you might consider fitting a hierarchal topic model, first described in this paper and in more detail in this one. From the first:

In our approach, each node in the hierarchy is associated with a topic, where a topic is a distribution across words. A document is generated by choosing a path from the root to a leaf, repeatedly sampling topics along that path, and sampling the words from the selected topics. Thus the organization of topics into a hierarchy aims to capture the breadth of usage of topics across the corpus, reflecting underlying syntactic and semantic notions of generality and specificity.

Meaning, using this model, all documents start at the root node, which will include the most common words in the corpus. (See the paper for examples.) This lets you avoid determining a list of stop words manually:

The model has nicely captured the function words without using an auxiliary list, a nuisance that most practical applications of language models require. At the next level, it separated the words pertaining to neuroscience abstracts and machine learning abstracts. Finally, it delineated several important subtopics within the two fields. These results suggest that hLDA can be an effective tool in text applications.

Implementation here. (Unaware of one in Python.)