Solved – Clustering based on large Jensen-Shannon Divergence distance matrix

claraclusteringdistancek medoidsk-means

I have a dataset with large number of features and about 15 000 observations. I’m using a probability distribution distance metric related to Jensen-Shannon divergence (JSD) to cluster the observations calculated as described in http://enterotype.embl.de/enterotypes.html. I’m applying R implementation of Partitioning Around Medoids (PAM) clustering to my JSD distance matrix.

The issue is that the size of the distance matrix seems to be too big and eats all the memory. I’m looking for alternative implementations. k-means doesn’t work with other distance metrics than eucleidean, R clara works only with eucleidean and manhattan distance matrices. Sparks do not support pam or clara yet.

UPDATE 5.2.2015:
Thanks, Aleksandr for thorough collection of links! I would actually like to stay in the scope of my question. I'm interested in similarities of my documents. The JSD distances express the document similarities very well and I'm happy with it. I would like to continue from that. I need to figure out for each document the most similar ones that are relevant from business point of view.

There are two overly naive approaches by ordering the similar documents for each document according to JSD distance and either select n first or n first below some JSD threshold as the similar ones.

The better approach is to let clustering algorithm to form the clusters because both of those naive approaches either cut off too many similar ones or identify similars also those that are actually not.

So my question is: which clustering algorithm is the most reasonable to use given that I have very good JSD distance matrix to feed in already.

Best Answer

Depending on specifics, consider the following alternatives. I am sure that you're familiar with some, but maybe not all methods. Additionally, some of the papers, which I've referenced below, describe algorithm modifications, which might be appropriate for your specific task and data sets.

  • K-means adaptations. For example, see this paper and this paper. Also, see this paper on using bootstrapping in K-means cluster analysis (while the paper is focused on the speed, the space improvement is IMHO implied as well due to nature of the bootstrapping process).
  • Model-based clustering: mixture modeling. This is an interesting option, implemented in several R packages, most notably mclust (http://www.stat.washington.edu/mclust). The approach, methods and software are well presented in this vignette paper.
  • Model-based clustering: Dirichlet processes (DP). Another popular option is Bayesian-based Dirichlet mixture models and hierarchical DP. If I understand the material correctly, (probabilistic) topic modeling also fits this category and includes such approaches, as latent Dirichlet allocation (LDA) (note: not to be confused with different method with same abbreviation - linear discriminant analysis (LDA), mostly used for dimensionality reduction, as I understand). More information on LDA: this introductory paper, some other relevant publications and a very recent paper on much improved LDA approach.
  • Hierarchical clustering (HC). In addition to traditional HC, you may find some interesting hybrid approaches, such as Dirichlet diffusion trees, which applies HC approach to DP mixtures (see this paper; other partial related research can be found via links on this page).
  • Latent variable modeling (LVM)-based clustering. Clustering applications include latent class analysis (LCA)-based latent class clustering (LCC) (see this paper) and latent tree models, described here. For some discussion, including a comparison between LCA and K-means, see this page and this paper.
  • Information theory-based clustering. For example, see this paper.
  • Neural networks- and genetic algorithms-based clustering. For example, see this paper.
  • If I remember correctly, I think that I've also seen some papers on using entropy for classification, but can't find them at the moment (will update, if that changes).
  • Some other interesting/relevant papers: a comparison of LCC and PAM; clustering with Bregman divirgences (probably belongs to information theory-based clustering category).
  • Some relevant discussions on Cross Validated: here and here
  • For determining an optimal number of clusters, see this excellent answer on StackOverflow.
Related Question