Silhouette Clustering – Practical Application of Silhouette Clustering Index

clusteringk-meanslarge data

I don't have much experience with data analysis algorithms (data mining, machine learning, if you like) and I'm interested if some could share their experience with practical usage of Silhouette in cluster validation/interpretation.

Concretely, is it practical to use it in case of big data? My question comes from the fact that you need n^2 calculations to get the output, which could be a problem in case of larger data and intense distance measure.

For example, I cluster signals of average length of 1500 points per signal and currently I have about 10.000 signals (and there could be much more). That means that if I want to produce silhouette I need to calculate 10^8 distances, which is a lot.

Also, I need to calculate distances between all the points, on the other hand, to perform K-means (for example) I have much less calculation. Somehow a question arises – should I allow that much calculations for the validation, than for the clustering itself?

Is silhouette really practically used?

Best Answer

I want to produce silhouette I need to calculate 10^8 distances

So this question is rather about (resources for) computation a big square distance matrix than about Silhouette criterion. Yes you need all the distances to be able to compute silhouette values: this criterion is matrix-based.

Is silhouette really practically used?

Yes, sure. Cluster analysis - for your information - is used not only for big data.

But if you are doing specifically K-means clustering on big data then silhouette is not the best choice. K-means is all about centroids and variance and therefore criterions such as Calinski-Harabasz are more focused here.

I'm using K-medoids variant, so I'm not sure if that would work? Also, I'm using a specific distance measure (Dynamic Time Warping)

Standard (original, by Kaufman & Rousseeuw) Silhouette Statistic is based on average distance between a point of interest and points of a cluster (its own cluster or an alien cluster). Averaged distances is perhaps the most "universal" or "neutral" (assumption-free) measure of cluster closeness or within-cluster density. If so, then you may use Silhouette to assess partitions produced by any clustering method (provided that you can - your data is not too big).

On the other hand, the Silhouette formula itself is general and so the criterion could be made to reflect other than average-distance notion. For example, my SPSS macro (found on my web-page) computing silhouette implements also nearest-neighbour, farthest neighbour, and distance-to-centroid notions. Thus, Silhouette criterion can exist in a number of versions; some version is better suited for one clustering method, another version - for another clustering method.

It is possible to program a special distance-to-medoid and ultimately distance-to-whatever-your-fancy versions of silhouette criterion. All one will need is to supply all the distances to those "quasi-centres" of clusters. And by the way, then - if you get all the distances from each point to each "centre" - you won't need at all the matrix of pairwise distances between all the points, and a bigger data could be happily processed.

The formula of Silhouette itself and the most common (average-distance) version of the index is not tethered to a specific distance measure. However, bred specific other versions of it could appear tethered.

Related Question