Solved – LDA topic modelling improvement

machine learningnatural languagetopic-models

I am working on an LDA model to identify the topic of ~100,000 online courses based on their course descriptions and titles. In the further process, I would like to use these topics to cluster these courses. So far, the topics identified by our model have not been great and I am looking for ways to improve – and get to get some thoughts on my attempts for improvement. Here is a quick summary of our current – pretty standard – approach as well as some ideas that I have for improvement:

1.Merging title, subtitle, and course description

2.Removing descriptions on length < 100 words, and non-english descriptions

For training, I am using only longer descriptions in English. Of course this means that courses with non-English descriptions will be classified randomly.

  1. Pick a random 30,000 descriptions

The number is somewhat arbitrary. I have noticed that the topics are more "clear" when using less descriptions for training. However, we don't want our topics to be biased based on the random descriptions that were chosen in this step.

  1. Removing stopwords

Both self-defined and using library.

  1. Removing punctuation

  2. Lemmatizing words

  3. Removing words that appear in over 50% of documents

To identify reoccurring topics I ran the model multiple times in a for loop and printed the resulting topics. Based on the topic-overlap of those iterations, I am considering to add Wikipedia articles related to the reoccurring topics and add them to the descriptions we use for training. This way I am hoping to "strengthen" those topics in the training data and make them more clear – in the hope of getting more interpretable topics. Currently, I am adding around 150 Wikipedia articles to a corpus of 30,000 course descriptions and the results seem to be promising.

My main question is: Is the approach of adding pre-selected wikipedia articles into our training data valid? What are the implications of this?

I am aware that by using this approach, I'm "pushing" the model in the direction of topics that we saw in initial runs – however, I believe that training on this data set will lead to a better/more interpretable classification of course descriptions.

What are your thoughts?

Best Answer

Using Wikipedia articles for your data augmentation sounds like quite reasonable for your task. I have worked on online course data (although not for LDA) and used similar approach. But using only 150 Wikipedia articles sounds like not enough to me. Wikipedia's API is well established and therefore obtaining a much larger Wikipedia corpus is possible with a little effort.

In our research, we represent courses using category labels from relevant Wikipedia articles, and "relevant" here is defined by top-100 in cosine similarity between a Wikipedia article and a course description. For our task of learning a concept graph, we found this approach to perform the best comparing to other representation schemes. If you are interested in details please look at here.

For topic modelling, you might actually benefit more from Wikipedia articles because their titles often serve as either pretty good topic labels, or as anchor words.