I am working on a dataset which are rather unusual in the following ways:
- It doesn't have just natural language text, it has text like user name, even code snippets etc.
- Unusually large vocabulary (unique tokens) size (2M) for a 750K set of documents and about 19M tokens.
All the aspects of the dataset are important and have to be included in the training i.e. the usernames, code snippets etc.
I trained an Latent Dirichet Allocation (LDA) after tokenization, removal of stop words and stemming. Training set size is 720K which about 16M tokens. I trained for 200 and 300 topics and 50 and 100 passes over training data.
I was testing on the test set to see a distribution of first 5 most probable topics of each document in the test set.
What I found was that it is falling the Zip's law for both 200 and 300 number of topics.
Can someone explain why this is happening? Less training or more training or what could be reason?
Attached is the distribution of 200(orange) topics and 300(blue) topics. (Sorry about the wrong title.) The graphs are plotted by extracting top-5 topics of each document and then counting the value for each topic i.e. topic-frequency in test set and plotting the frequency in decreasing order.
Best Answer
My first bet would be that the function words in a corpus of source code differ vastly from those of standard stop lists, and that your model's first topic is indeed capturing standard programming fare:
if
,int
,new
,while
, etc.Besides building a custom stop list—seeing which words have high probability under the most frequently assigned topics is a good place to start—you might consider fitting a hierarchal topic model, first described in this paper and in more detail in this one. From the first:
Meaning, using this model, all documents start at the root node, which will include the most common words in the corpus. (See the paper for examples.) This lets you avoid determining a list of stop words manually:
Implementation here. (Unaware of one in Python.)