Clustering – Comparing Partitional vs. Hierarchical Clustering with Distance Matrix

clusteringmachine learningr

I am exploring the flexibility of partitional clustering algorithms. In particular, I would like to introduce more general distances than the ones which are used by default.

Let us consider, for simplicity, dbscan contained in the R-package fpc. It allows the user to specify a "data matrix, data.frame, dissimilarity matrix or dist-object".

My idea would be to compute the distance matrix of the given data w.r.t. my chosen distance, and run dbscan.

Here comes the point where I am stuck. Is it true that specifying a distance matrix should lead inevitably to a hierarchical clustering? My intuition says that a hierarchical clustering in presence of a distance matrix makes more sense that a partitional one on the elements of the matrix itself. As I am no expert in clustering, I cannot judge the above statement properly.

Would you use a partitional algorithm on the distance matrix of a given dataset? Is this correct?

Thank you all!

Best Answer

You may want to read up on Generalized DBSCAN. DBSCAN itself only needs a binary decision.

So in fact, you can map the matrix to a binary matrix $x\ge\varepsilon$ and then set $\varepsilon=.5$ and you will get the exact same result with DBSCAN. It only needs a predicate of "is a neighbor", not an actual measure of neighborness.

There exist improved versions of DBSCAN for a long time, namely OPTICS, and just this year DBSCAN was revisited, incorporating ideas from OPTICS, as HDBSCAN*. Both of these actually produce a hierarchy.

Putting restrictions on the distance functions is mostly of interest for performance. Some distances can be accelerated with index structures, at which point these algorithm can run in less than $\mathcal O(n^2)$. Anything that is based on a distance matrix will obviously need at least $\mathcal O(n^2)$ memory and runtime.

The R options for clustering are in my opinion not very good. fpc is really slow. For clustering, I perfer ELKI. It is very flexible; if you are interested in evaluating the flexibility of clustering algorithms you should probably have a good look at it. It also has support for index structures.

Related Question