Solved – Interpret Silhouette plot for large microarray dataset

bioinformaticsclusteringdistance-functionsmicroarray

For a microarray experiment with ~40,000 probes and ~30 samples I used the clara function from R to cluster my expression matrix. How do I interpret this silhouette plot?

my sil plot!

Firstly, I don't understand how a k of 3 could have the highest sil, considering the algo must be putting together very different genes.

Secondly, many of the clusters for k > 100 have lots of zero and negative scores that is throwing off the average from otherwise tighter clusters (which are the ones I want anyway). How do I improve my choice of k? Is it ok to divide the average silhouette by k? Only take the average of positive silhouettes?

Best Answer

Silhouette statistic is computed for every object from the set of objects being clustered (what is objects in your case - probes?). Sole objects (objects remained unclustered) in the solution receive silhouette value 0. This of course affects the average silhouette value. You might want to consider quality of clustering only among those objects that were clustered. So, set silhouette value for sole objects to missing value rather than 0 before averaging. This trick implies that sole objects are treated as noise points only and not as clusters on their own. Please be aware I'm not R user and therefore can't comment on clara function.