Solved – Looking for sparse and high-dimensional clustering implementation

algorithmsclusteringlarge data

I'm looking for a clustering implementation with the following features:

  • Support for high-dimensional data. Now I have approximately 160.000 dimensions/features.
  • Be able to manage sparse matrix. That is, not only to read sparse matrices, but also capable of making operations in this format.
  • Properly shows the centroid for each cluster.

I've tested some packages:

  • Rapidminer, which seems to be a memory eater, I suppose because although capable of reading a sparse matrix, it is not capable of working with them as they are.
  • Cluto, which is very fast and low-memory consumption, but it is not able of show properly the centroid elements (source code not available). It shows descriptive features together with a percentage of how that feature contributes to the average similarity, but there is no clear info (here is a question about that, with no clear answer) about how is calculated that, and also I have clusters where there is 0.0% but it is not clear for me if this means the program is unable to show an upper precision or if that feature has nothing to do tho the average similarity.

I appreciate any comment or answer about it.

Best Answer

I recommend you to see the answer that JCWong gave in this question about a method called 'sparse clustering' developed by Robert Tibshirani & Daniela Witten. This method is able to select the only features that are really determining the differences between groups in the data. It is available as an R library called 'sparcl'

The article is:

Witten DM and R Tibshirani (2010) A framework for feature selection in clustering. Journal of the American Statistical Association 105(490): 713-726.