Solved – Topic modeling (LDA) and n grams

text miningtopic-models

I am new to text mining…. reading stuff about it and setting up some KNIME workflows and Python tooling. I'd like to analyze customer feedback from surveys to derive topics from it without going through thousands of open text surveys manually.

However when reading about topic modelling and LDA it's all about the individual words, not taking n-grams into account. I was thinking about the following approach. Would that make sense:

  1. CLEANING: get the responses and get rid of punctuation, stop words, capitalization, etc.
  2. STEMMING: get back to the stems
  3. N-GRAMS: check the n_grams in the stemmed text
  4. REPLACE: replace the n_gram word combinations x y z in the text with x_y_z
  5. LDA: run LDA on top of this to derive topics

does this make sense as a workflow? and mathematically?

Best Answer

Not sure if you're still looking for an answer, but I'll chime in anyways.

In short: this "technically" works fine. You're essentially just supplementing your documents to have a big list of all contiguous pairs of words (or n-tuples for general n-grams) at the end.

The topic model will think that new_york has nothing to do with new or york. Although you might notice that these singular pieces of phrases and the phrases show up together in the topics often... after all topic models find groups of words that cooccur in the same documents often!

Complications: Now the only real issue with this is that your model may have a harder time fitting to your data if they're supplemented with n-grams. Why? Because you will have a much larger vocabulary to work with, and you'll need to jack up the number of topics to get a comparable fit.

This of course will make your model training much slower.

If you prune your vocabulary, you'll cut out more rare words and keep more phrases, essentially artificially inflating the topic proportions of your documents.

You might benefit from judiciously choosing the phrases in your vocabulary instead of just using ngrams (actually probably all text analysis can reap this same benefit). For some ideas, you might find this useful. http://www.mimno.org/articles/phrases/

Background on topic models that may give the above appropriate context:

LDA is simply finding a mixture of distributions of terms for each document that leads to the (approximate) maximal value under the posterior probability of the document-topic proportions and the topic-word proportions (outside of the documents). Mathematically this looks like:

$logp(\beta_{1:K}, \theta_{1,D}, z_{1,DN}, w_{1,DN}) = \\ \Sigma_k^Klogp(\beta_k) + \Sigma_d^D[logp(\theta_d) + \Sigma_i^Nlogp(z_{i,d} | \theta_d) + \Sigma_i^Nlogp(w_{i,d}|z_{i,d},\beta_{1:K})] $

(Note that working in the logspace lets us separate the multiplicative factors into sums for convenience.)

The important factors in this are the last two terms, because chances are your documents have a reasonable number of words, and so the $ND$ terms there will dominate the function.

So let's look at those two terms more closely:

$logp(z_{i,d} | \theta_d)$ is just the $logp$ of that word getting topic assignment $z_i$ under the topic distribution for that document.

This term will be high if the document has very few topics.

The other dominating term in that big equation is $logp(w_{i,d}|z_{i,d},\beta_{1:K})$. This term simply asks for for the probability of that word under the topic it has been assigned to by $z_{i,d}$. It will be high when that word is very likely in the topic.

These two terms are at odds with each other. One wants few topics per document so all of the topic assignments have high likelihood, while the other wants only a few words per topic so those chosen few words can have high likelihood.