Clustering – How to Validate Clustering Results with Labeled Data

clusteringvalidation

I am working on a clustering algorithm and would like to validate its performance against a well-known and used dataset: the KDD-CUP 99 dataset (http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html). With this dataset, both unlabeled and labeled test data is provided. My question is, how should I validate my clustering algorithm's performance?

Let's say the results of my algorithm are as follows:
x1 -> cluster A
x2 -> cluster A
x3 -> cluster B
x4 -> cluster A

And let's say the labels provided are as follows:
x1 -> cluster 1
x2 -> cluster 1
x3 -> cluster 1
x4 -> cluster 2

Given that the cluster labels are completely different, how should I compare these? In this case, an obvious assumption would be to say that cluster A is probably the same as cluster 1, but this may not always be this obvious. Is there any standardized way to evaluate such situations?

Best Answer

Look into distances between clusterings. They all use what is called the confusion matrix between two clusterings. Well known are the Rand index and the adjusted Rand index, but I generally recommend using either Variation of Information or the not well known split-join distance (see e.g. Comparing clusterings: Rand Index vs Variation of Information and How to interpret these indices/metrics for comparing partitions intuitively out of these images? for more discussion).

Related Solutions

Solved – Clustering with asymmetrical distance measures

If the M-F distance is asymmetric because the future is different from the past, then a genuine asymmetric clustering is called for. First, an asymmetric distance function must be defined.

One way to to asymmetric clustering, given a distance function, is to embed the original data into a new coordinate space. See "Geometrical Structures of Some Non-Distance Models for Asymmetric MDS" by Naohito Chino and Kenichi Shiraiwa, Behaviormetrika, 1992 (pdf). This is called HCM (the Hermitian Canonical Model).

Find a Hermitian matrix $H$, where $$ H_{ij} = \frac 1 2 [d(x_i, x_j) + d(x_j, x_i)] + i \frac 1 2 [d(x_i, x_j) - d(x_j, x_i)] $$ Find the eigenvalues and eigenvectors, then scale each eigenvector by the square root of its corresponding eigenvalue.

This transforms the data into a space of complex numbers. Once the data is embedded, the distance between objects x and y is just x * y, where * is the conjugate transpose. At this point you can run k-means on the complex vectors.

Spectral asymmetric clustering has also been done, see the thesis by Stefan Emilov Atev, "Using Asymmetry in the Spectral Clustering of Trajectories," University of Minnesota, 2011, which gives MATLAB code for a special algorithm.

Clustering – How to Select and Validate a Clustering Method?

Often they say that there is no other analytical technique as strongly of the "as you sow you shall mow" kind, as cluster analysis is.

I can imagine of a number dimensions or aspects of "rightness" of this or that clustering method:

Cluster metaphor. "I preferred this method because it constitutes clusters such (or such a way) which meets with my concept of a cluster in my particular project". Each clustering algorithm or subalgorithm/method implies its corresponding structure/build/shape of a cluster. In regard to hierarchical methods, I've observed this in one of points here, and also here. I.e. some methods give clusters that are prototypically "types", other give "circles [by interest]", still other "[political] platforms", "classes", "chains", etc. Select that method which cluster metaphor suits you. For example, if I see my customer segments as types - more or less spherical shapes with compaction(s) in the middle I'll choose Ward's linkage method or K-means, but never single linkage method, clearly. If I need a focal representative point I could use medoid method. If I need to screen points for them being core and peripheral representatives I could use DBSCAN approach.
Data/method assumptions. "I preferred this method because my data nature or format predispose to it". This important and vast point is also mentioned in my link above. Different algorithms/methods may require different kind of data for them or different proximity measure to be applied to the data, and vice versa, different data may require different methods. There are methods for quantitative and methods for qualitative data. Mixture quantitative + qualitative features dramatically narrows the scope of choice among methods. Ward's or K-means are based - explicitly or implicitly - on (squared) euclidean distance proximity measure only and not on arbitrary measure. Binary data may call for special similarity measures which in turn will strongly question using some methods, for example Ward's or K-means, for them. Big data may need special algorithms or special implementations.
Internal validity. "I preferred this method because it gave me most clear-cut, tight-and-isolated clusters". Choose algorithm/method that shows the best results for your data from this point of view. The more tight, dense are clusters inside and the less density is outside of them (or the wider apart are the clusters) - the greater is the internal validity. Select and use appropriate internal clustering criteria (which are plenty - Calinski-Harabasz, Silhouette, etc etc; sometimes also called "stopping rules") to assess it. [Beware of overfitting: all clustering methods seek to maximize some version of internal validity$^1$ (it's what clustering is about), so high validity may be partly due to random peculiarity of the given dataset; having a test dataset is always beneficial.]
External validity. "I preferred this method because it gave me clusters which differ by their background or clusters which match with the true ones I know". If a clustering partition presents clusters which are clearly different on some important background (i.e. not participated in the cluster analysis) characteristics then it is an asset for that method which produced the partition. Use any analysis which applies to check the difference; there also exist a number of useful external clustering criteria (Rand, F-measure, etc etc). Another variant of external validation case is when you somehow know the true clusters in your data (know "ground truth"), such as when you generated the clusters yourself. Then how accurately your clustering method is able to uncover the real clusters is the measure of external validity.
Cross-validity. "I preferred this method because it is giving me very similar clusters on equivalent samples of the data or extrapolates well onto such samples". There are various approaches and their hybrids, some more feasible with some clustering methods while others with other methods. Two main approaches are stability check and generalizability check. Checking stability of a clustering method, one randomly splits or resamples the data in partly intersecting or fully disjoint sets and does the clustering on each; then matches and compares the solutions wrt some emergent cluster characteristic (for example, a cluster's central tendency location) whether it is stable across the sets. Checking generalizability implies doing clustering on a train set and then using its emergent cluster characteristic or rule to assign objects of a test set, plus also doing clustering on the test set. The assignment result's and the clustering result's cluster memberships of the test set objects are compared then.
Interpretation. "I preferred this method because it gave me clusters which, explained, are most persuasive that there is meaning in the world". It's not statistical - it is your psychological validation. How meaningful are the results for you, the domain and, possibly audience/client. Choose method giving most interpretable, spicy results.
Gregariousness. Some researches regularly and all researches occasionally would say "I preferred this method because it gave with my data similar results with a number of other methods among all those I probed". This is a heuristic but questionable strategy which assumes that there exist quite universal data or quite universal method.

Points 1 and 2 are theoretical and precede obtaining the result; exclusive relying on these points is the haughty, self-assured exploratory strategy. Points 3, 4 and 5 are empirical and follow the result; exclusive relying on these points is the fidgety, try-all-out exploratory strategy. Point 6 is creative which means that it denies any result in order to try to rejustify it. Point 7 is loyal mauvaise foi.

Points 3 through 7 can also be judges in your selection of the "best" number of clusters.

$^1$ A concrete internal clustering criterion is itself not "orthogonal to" a clustering method (nor to the data kind). This raises a philosophical question to what extent such a biased or prejudiced criterion can be of utility (see answers just noticing it).

Best Answer

Related Solutions

Solved – Clustering with asymmetrical distance measures

Clustering – How to Select and Validate a Clustering Method?

Related Question