Solved – K-means clustering in Matlab for feature selection

k-meansMATLAB

I am doing feature selection on a cancer data- set which is multidimensional (27803 * 84).
I want to try with k-means clustering algorithm in Matlab but how do I decide how many clusters do I want? Is it equal to the number of classes I have? (in my case- it is 2- cancer or no cancer). Please suggest. Thank you.

Best Answer

This is a rather well-known problem with k-means clustering - there's not a great way to choose k a priori. See http://en.wikipedia.org/wiki/Determining_the_number_of_clusters_in_a_data_set

Generally, you end up running k-means for a lot of values of k, and then choosing the best k based on some metric of goodness, several are suggested in the linked article.

You could try running a principle components analysis, to reduce the dimensionality of your space to something smaller. If you find that only a few principle components provide the dominant discrimination ability between cancer and no-cancer, you may be able to whip up some plots in the reduced dimensionality space to visually decide how many clusters.

If you have access to Science, there's an interesting algorithm discussed here http://www.sciencemag.org/content/344/6191/1492.full, which doesn't rely on sphericality like k-means does and also provides some metrics for choosing the number of clusters in a principled fashion.