I am working in a genomics project and I ended up having a huge table with around 800 measurements (cases/rows), around 200 channels (columns/continuous variables) and 5 categories (one categorical column)
I would like to do two things:
- Try to find sub-groups in the different levels of the categorical variable that I already have
- create a new classification of these 800 measurements based only in the information
I have been doing my homework and read about using different strategies like (k-means or PCA) but I have found that it is very useful to get rid of redundant variables. How can I choose these properly?
Someone recommended me to use multinomial regression, any good resource you recommend to have a bite?
I am using R.
Many thanks
Best Answer
There is a fairly new technique that Tibshirani and his student developed called "sparse clustering". I think it is meant exactly for this situation, where there are many predictors but we would like to find a small subset of them that really matter. It is available in R as the "sparcl" package, implementing a sparse version of k-means and hierarchical clustering.