Solved – Clustering quality measure – Math Solves Everything

I have a clustering algorithm (not k-means) with input parameter $k$ (number of clusters). After performing clustering I'd like to get some quantitative measure of quality of this clustering.
The clustering algorithm has one important property. For $k=2$ if I feed $N$ data points without any significant distinction among them to this algorithm as a result I will get one cluster containing $N-1$ data points and one cluster with $1$ data point. Obviously this is not what I want. So I want to calculate this quality measure to estimate reasonability of this clustering. Ideally I will be able to compare this measures for different $k$. So I will run clustering in the range of $k$ and choose the one with the best quality.
How do I calculate such quality measure?

UPDATE:

Here's an example when $(N-1, 1)$ is a bad clustering. Let's say there are 3 points on a plane forming equilateral triangle. Splitting these points into 2 clusters is obviously worse than splitting them into 1 or 3 clusters.

Best Answer

The choice of metric rather depends on what you consider the purpose of clustering to be. Personally I think clustering ought to be about identifying different groups of observations that were each generated by a different data generating process. So I would test the quality of a clustering by generating data from known data generating processes and then calculate how often patterns are misclassified by the clustering. Of course this involved making assumtions about the distribution of patterns from each generating process, but you can use datasets designed for supervised classification.

Others view clustering as attempting to group together points with similar attribute values, in which case measures such as SSE etc are applicable. However I find this definition of clustering rather unsatisfactory, as it only tells you something about the particular sample of data, rather than something generalisable about the underlying distributions. How methods deal with overlapping clusters is a particular problem with this view (for the "data generating process" view it causes no real problem, you just get probabilities of cluster membership).

Best Answer

Related Solutions

Solved – How to evaluate “external” quality of clustering

Solved – Measuring k-means clustering quality on training and test sets

Related Question