Solved – how to perform divisive hierarchical clustering

clusteringdata miningdatasethierarchical clustering

I've been trying for a long time to figure out how to perform (on paper) the divisive hierarchical clustering algorithem, however I'm not able to understand how to do it exactly.
example:

divisive hierarchical clustering

I need to do it using Manhattan distance.

According to what I understand, I need to calculate the distances matrix (using Manhattan distance). then put all the observation in one cluster (let's call it A) and have another cluster, which at the beginning will be empty, B.

Then I need to calculate: a(1)=[d1,2+d1,3+d1,4+d1,5]/4 for all i in 1,…,5
and calculate a(i)-d(i,B) and choose the observation the gives me the maximal a(i)-d(i,B) and move it to B.

Do I need to calculate again the distances matrix?
How do I calculate d(i,B) when be contains more then one observation?

thank you!!

Best Answer

This question is rather old, but I think an answer may help some people.

I understand that you mean by this 5 points in three dimensions. You say "the divisive hierarchical clustering algorithm". I am going to assume that you want the DIANA algorithm (Kaufman, L.; Rousseeuw, P.J. (1990) Finding Groups in Data: An Introduction to Cluster Analysis).

As you said, we start with all points in a single cluster. At each step, we will split the largest cluster (the one with the largest diameter) into two pieces. To divide the data, we start by finding the single point which is the most dissimilar to the rest of the cluster. Then we successively move points to the new cluster if they are more similar to the new cluster than the old one As you noted in your question, that requires that we have a method for computing the distance from a point to a group of points. The method used in DIANA is the average of all the distances from the point to the points in the group. Throughout the process, we need the distance matrix.

  1 2 3 4 5
1 0 4 4 5 7
2 4 0 4 3 3
3 4 4 0 5 7
4 5 3 5 0 2
5 7 3 7 2 0

The top level cluster is all points {1,2,3,4,5}.

Which point is most dissimilar? As you noted, we take the average distance to the other points. So, just for example, for point 1 we compute

[ d(1,2) + d(1,3) + d(1,4) + d(1,5) ] / 4
= (4 + 4 + 5 + 7) / 4 = 5

The average distances for all points are:

   1    2    3    4    5 
5.00 3.50 5.00 3.75 4.75

Since points 1 and 3 are tied for the most dissimilar, we pick one of these arbitrarily. I will use point 1. Using your notation, we have A = {2,3,4,5} and B = {1}

Now we want to move any points that are closer to B than (the other points in) A into B. So for each point x in A we compute d(x, A-x) and d(x,B). For example, for point 2 we compute

[ d(2,3) + d(2,4) + d(2,5) ] / 3 - d(2,1)
= 10/3 - 4 = -2/3

All of the differences are:

 2      3       4      5
-2/3   4/3    -5/3    -3

Only point 3 is bigger than zero so we move it to cluster B. We now have A = {2,4,5} B = {1,3}

We check if any additional points should be moved. Again, we compute d(x, A-x) - d(x,B) for each point in A. The differences are:

   2    4    5 
-1.0 -2.5 -4.5 

All are negative (that is the remaining points in A are closer to A than to B), so we stop this division and we have the two clusters {2,4,5} and {1,3}.

For the next step, we choose the cluster with the largest diameter, that is the cluster with the greatest distance between two points in the cluster.

diameter({1,3})   = d(1,3) = 4
diameter({2,4,5}) = max(d(2,4), d(2,5), d(4,5))  
                  = max(3, 3, 2) = 3

So cluster {1,3} has the largest diameter. Trivially, this will be split into {1} and {3}. So now we have clusters {2,4,5}, {1} and {3}.

At the next step, we must split the cluster {2,4,5}. As above, for each point we will find the point with the largest average distance to the rest of the points. For example we compute

point 2: [d(2,4) + d(2,5)] / 2 = [3+3]/2 = 3

The other averages are point 4: 5/2 and point 5: 5/2, so point 2 is the most dissimilar. We split {2,4,5} into A={4,5} and B={2}. We need to check if some additional points should be moved into the set B.

d(4,5) - d(4,2} = 2-3 = -1
d(5,4) - d(5,2} = 2-3 = -1

So no additional points should be moved. We split {2,4,5} into {2} and {4,5} giving us the partition of the full data set {1}, {2}, {3}, {4,5}

Finally, we divide the last cluster to get {1} {2} {3} {4} {5} Thus, the full hierarchy looks like this:

dendrogram