Clustering – How to Perform Clustering with a Distance Matrix

clustering

I have a (symmetric) matrix M that represents the distance between each pair of nodes. For example,

    A   B   C   D   E   F   G   H   I   J   K   L
A   0  20  20  20  40  60  60  60 100 120 120 120
B  20   0  20  20  60  80  80  80 120 140 140 140
C  20  20   0  20  60  80  80  80 120 140 140 140
D  20  20  20   0  60  80  80  80 120 140 140 140
E  40  60  60  60   0  20  20  20  60  80  80  80
F  60  80  80  80  20   0  20  20  40  60  60  60
G  60  80  80  80  20  20   0  20  60  80  80  80
H  60  80  80  80  20  20  20   0  60  80  80  80
I 100 120 120 120  60  40  60  60   0  20  20  20
J 120 140 140 140  80  60  80  80  20   0  20  20
K 120 140 140 140  80  60  80  80  20  20   0  20
L 120 140 140 140  80  60  80  80  20  20  20   0

Is there any method to extract clusters from M (if needed, the number of clusters can be fixed), such that each cluster contains nodes with small distances between them. In the example, the clusters would be (A, B, C, D), (E, F, G, H) and (I, J, K, L).

I've already tried UPGMA and k-means but the resulting clusters are very bad.

The distances are the average steps a random walker would take to go from node A to node B (!= A) and go back to node A. It's guaranteed that M^1/2 is a metric. To run k-means, I don't use the centroid. I define the distance between node n cluster c as the average distance between n and all nodes in c.

Thanks a lot 🙂

Best Answer

There are a number of options.

k-medoids clustering

First, you could try partitioning around medoids (pam) instead of using k-means clustering. This one is more robust, and could give better results. Van der Laan reworked the algorithm. If you're going to implement it yourself, his article is worth a read.

There is a specific k-medoids clustering algorithm for large datasets. The algorithm is called Clara in R, and is described in chapter 3 of Finding Groups in Data: An Introduction to Cluster Analysis. by Kaufman, L and Rousseeuw, PJ (1990).

hierarchical clustering

Instead of UPGMA, you could try some other hierarchical clustering options. First of all, when you use hierarchical clustering, be sure you define the partitioning method properly. This partitioning method is essentially how the distances between observations and clusters are calculated. I mostly use Ward's method or complete linkage, but other options might be the choice for you.

Don't know if you tried it yet, but the single linkage method or neighbour joining is often preferred above UPGMA in phylogenetic applications. If you didn't try it yet, you could give it a shot as well, as it often gives remarkably good results.


In R you can take a look at the package cluster. All described algorithms are implemented there. See ?pam, ?clara, ?hclust, ... Check also the different implementation of the algorithm in ?kmeans. Sometimes chosing another algorithm can improve the clustering substantially.


EDIT : Just thought about something: If you work with graphs and nodes and the likes, you should take a look at the markov clustering algorithm as well. That one is used for example in grouping sequences based on blast similarities, and performs incredibly well. It can do the clustering for you, or give you some ideas on how to solve the research problem you're focusing on. Without knowing anything about it in fact, I guess his results are definitely worth looking at. If I may say so, I still consider this method of Stijn van Dongen one of the nicest results in clustering I've ever encountered.

http://www.micans.org/mcl/