R – Suggestions for Multi-Dimensional Clustering

clusteringdimensionality reductionlarge datamodel-based-clusteringr

I am working in a genomics project and I ended up having a huge table with around 800 measurements (cases/rows), around 200 channels (columns/continuous variables) and 5 categories (one categorical column)

I would like to do two things:

  • Try to find sub-groups in the different levels of the categorical variable that I already have
  • create a new classification of these 800 measurements based only in the information

I have been doing my homework and read about using different strategies like (k-means or PCA) but I have found that it is very useful to get rid of redundant variables. How can I choose these properly?

Someone recommended me to use multinomial regression, any good resource you recommend to have a bite?

I am using R.
Many thanks

Best Answer

There is a fairly new technique that Tibshirani and his student developed called "sparse clustering". I think it is meant exactly for this situation, where there are many predictors but we would like to find a small subset of them that really matter. It is available in R as the "sparcl" package, implementing a sparse version of k-means and hierarchical clustering.