Solved – When to use LDA over GMM for clustering

clusteringgaussian mixture distributiontopic-modelsunsupervised learning

I have a dataset containing user activity with 168 dimensions, where I want to extract clusters using unsupervised learning. It is not obvious to me whether to use a topic modelling approach in Latent Dirichlet allocation (LDA) or Gaussian Mixture Models (GMM), which is more of a Bayesian approach. In that regard I have 2 related questions:

  1. What is the main differentiator between the two methods? I know the basics of the two models, but I am curious about what really sets one apart from the other. Can something in the problem/data tell me whether one model is a better fit?

  2. If I apply both methods to my data, how can I compare the results to see which method is better?

Update

The 168 users activity variables are counts of an activity, thereby holding positive discrete values. There is no maximum value, but roughly 90% of the variables attain values in the interval $[0,3]$.

It might make sense simply model all of these activity variables as binary variables describing whether it is zero or non-zero, but we do not yet know enough about the problem to determine that. The main thing we are looking for are insights into the different clusters of user activity.

Best Answer

I would not use Gaussian mixture models, as they require the constituent distributions to all be normal. You have counts, so GMM is inappropriate by definition.

Latent Dirichlet allocation (full disclosure: I don't really know topic modeling) requires your data to be multinomial, but you can have counts in that case—they would be counts of occurrences of different categories of a variable. Another possibility is that your counts are counts of different variables, as in having several Poisson variables. This is a bit of an ontological question for how you are thinking about your data.

Consider a simple example where I go to the grocery store because I want some fruit. I will purchase a certain number of apples, oranges, peaches and bananas. Each of those could be considered a separate Poisson variable. When I get home I put all of them in a fruit bowl. Later, when I feel like snacking, I might reach into the bowl without looking and grab two pieces of fruit (e.g., an apple and a peach). That can be considered a draw from a multinomial distribution. In both cases, I have counts of categories, but we think of them differently. In the first case, the fruits I will buy are known before I get to the grocery store, but the number purchased in each category can vary. In the second case, I don't know which fruits I will pick but I know I'm grabbing two from the possible types.

If your data are like the fruit bowl example, LDA may be appropriate for you. On the other hand, if they are like the grocery store example, you could try Poisson finite mixture modeling. That is, you can use mixture modeling with distributions other than Gaussian / normal. GMM's are the most common by far; other distributions (such as Poisson) are more exotic. I don't know how widely implemented they are in software. If you use R, Googling led to the discovery of ?PoisMixClus in the HTSCluster package and the rebmix package (note I've never used either, or done Poisson mixture modeling). It may be possible to find implementations for other software as well.


Adding some specifics: I would say LDA is at least as much a Bayesian technique as GMM.

  1. I suspect the most important differentiation between LDA and GMM is the type of data they assume you have.
  2. You cannot compare them, because they are for different kinds of data. (Nor would I really want to compare LDA and Poisson MM, as they conceptualize the counts differently.)

I would not dichotomize your data into zero / non-zero.