Clustering – How to Select and Validate a Clustering Method?

clusteringhierarchical clusteringmodel-evaluationvalidation

One of the biggest issue with cluster analysis is that we may happen to have to derive different conclusion when base on different clustering methods used (including different linkage methods in hierarchical clustering).

I would like to know your opinion on this – which method will you select, and how. One might say "the best method of clustering is which gives you the right answer"; but I may question in response that cluster analysis is supposed to be an unsupervised technique – so how do I know which method or linkage is the right answer?

In general: is a clustering alone robust enough to rely on? Or we need a second method and get a shared result to be based on both?

My question is not only about possible ways to validate / evaluate clustering performance, but is broader – on what basis do we select/prefer one clustering method/algorithm over another one. Also, is there common warnings that we should look around when we are selecting a method to cluster our data?

I know that it is very general question and very difficult to answer. I only would like to know if you have any comment or any advise or any suggestion for me to learn more about this.

Best Answer

Often they say that there is no other analytical technique as strongly of the "as you sow you shall mow" kind, as cluster analysis is.

I can imagine of a number dimensions or aspects of "rightness" of this or that clustering method:

  1. Cluster metaphor. "I preferred this method because it constitutes clusters such (or such a way) which meets with my concept of a cluster in my particular project". Each clustering algorithm or subalgorithm/method implies its corresponding structure/build/shape of a cluster. In regard to hierarchical methods, I've observed this in one of points here, and also here. I.e. some methods give clusters that are prototypically "types", other give "circles [by interest]", still other "[political] platforms", "classes", "chains", etc. Select that method which cluster metaphor suits you. For example, if I see my customer segments as types - more or less spherical shapes with compaction(s) in the middle I'll choose Ward's linkage method or K-means, but never single linkage method, clearly. If I need a focal representative point I could use medoid method. If I need to screen points for them being core and peripheral representatives I could use DBSCAN approach.

  2. Data/method assumptions. "I preferred this method because my data nature or format predispose to it". This important and vast point is also mentioned in my link above. Different algorithms/methods may require different kind of data for them or different proximity measure to be applied to the data, and vice versa, different data may require different methods. There are methods for quantitative and methods for qualitative data. Mixture quantitative + qualitative features dramatically narrows the scope of choice among methods. Ward's or K-means are based - explicitly or implicitly - on (squared) euclidean distance proximity measure only and not on arbitrary measure. Binary data may call for special similarity measures which in turn will strongly question using some methods, for example Ward's or K-means, for them. Big data may need special algorithms or special implementations.

  3. Internal validity. "I preferred this method because it gave me most clear-cut, tight-and-isolated clusters". Choose algorithm/method that shows the best results for your data from this point of view. The more tight, dense are clusters inside and the less density is outside of them (or the wider apart are the clusters) - the greater is the internal validity. Select and use appropriate internal clustering criteria (which are plenty - Calinski-Harabasz, Silhouette, etc etc; sometimes also called "stopping rules") to assess it. [Beware of overfitting: all clustering methods seek to maximize some version of internal validity$^1$ (it's what clustering is about), so high validity may be partly due to random peculiarity of the given dataset; having a test dataset is always beneficial.]

  4. External validity. "I preferred this method because it gave me clusters which differ by their background or clusters which match with the true ones I know". If a clustering partition presents clusters which are clearly different on some important background (i.e. not participated in the cluster analysis) characteristics then it is an asset for that method which produced the partition. Use any analysis which applies to check the difference; there also exist a number of useful external clustering criteria (Rand, F-measure, etc etc). Another variant of external validation case is when you somehow know the true clusters in your data (know "ground truth"), such as when you generated the clusters yourself. Then how accurately your clustering method is able to uncover the real clusters is the measure of external validity.

  5. Cross-validity. "I preferred this method because it is giving me very similar clusters on equivalent samples of the data or extrapolates well onto such samples". There are various approaches and their hybrids, some more feasible with some clustering methods while others with other methods. Two main approaches are stability check and generalizability check. Checking stability of a clustering method, one randomly splits or resamples the data in partly intersecting or fully disjoint sets and does the clustering on each; then matches and compares the solutions wrt some emergent cluster characteristic (for example, a cluster's central tendency location) whether it is stable across the sets. Checking generalizability implies doing clustering on a train set and then using its emergent cluster characteristic or rule to assign objects of a test set, plus also doing clustering on the test set. The assignment result's and the clustering result's cluster memberships of the test set objects are compared then.

  6. Interpretation. "I preferred this method because it gave me clusters which, explained, are most persuasive that there is meaning in the world". It's not statistical - it is your psychological validation. How meaningful are the results for you, the domain and, possibly audience/client. Choose method giving most interpretable, spicy results.

  7. Gregariousness. Some researches regularly and all researches occasionally would say "I preferred this method because it gave with my data similar results with a number of other methods among all those I probed". This is a heuristic but questionable strategy which assumes that there exist quite universal data or quite universal method.

Points 1 and 2 are theoretical and precede obtaining the result; exclusive relying on these points is the haughty, self-assured exploratory strategy. Points 3, 4 and 5 are empirical and follow the result; exclusive relying on these points is the fidgety, try-all-out exploratory strategy. Point 6 is creative which means that it denies any result in order to try to rejustify it. Point 7 is loyal mauvaise foi.

Points 3 through 7 can also be judges in your selection of the "best" number of clusters.


$^1$ A concrete internal clustering criterion is itself not "orthogonal to" a clustering method (nor to the data kind). This raises a philosophical question to what extent such a biased or prejudiced criterion can be of utility (see answers just noticing it).