Clustering – How to Derive a Distance Function Based on Multiple Variables for Cluster Analysis

clusteringdendrogramdistance-functionsmixed type data

I am not a statistician, so please excuse my lack of statistics knowledge/terminology.

I have bunch of network nodes that I want to run cluster analysis on and identify clusters. So as far as I understand, I can follow the following steps to run a hierarchical agglomerative analysis (HAC):

  1. Identify variables
  2. Define a distance function
  3. Run the algorithm to join closer clusters and create one big cluster
  4. Cut the dendrogram tree at the height that makes meaningful clusters based on context

My question is related to the second step even though it is not yet clear for me how I am going to do the last step.

The data I want to analyse is bunch of computer network nodes which are down (not responding). I have the following information for each network node:

  • location (longitude, latitude)
  • time node went down
  • network provider

These are the most relevant information I believe I have to take into consideration for my clustering. Basically I want to cluster the nodes that went down probably because of the same reason in a region.

For example if bunch of nodes went down at about the same time and physically they are close to each other and they have the same provider, so probably they fall into the same cluster.

Now the question is how do I derive my distance function and include all these variables in it such that it would make sense? In other words what is the mechanism to derive a distance function based on multiple variables?

Also as you notice the variables are of different types. So, in the distance function should I take care of this by considering Gower's coefficient of similarity? How?

Any examples or suggestion regarding whether I am in the right direction or not can be very helpful too.

Best Answer

Latitude and longitude are scale variables. Time is scale, too (I hope it is linear, not cyclic). Provider is nominal variable. I see two options:

  • Use Two-step cluster analysis. This is the method of choice if you have many (thousands) of objects (nodes) to cluster. This method has a nice option to detect outliers automatically; aside from this it is quite coarse method.
  • Use Hierarchical cluster analysis basing it on Gower coefficient (look here for links where you could compute it). This clustering is appropriate if the number of objects is, say, up to 500. You should choose among several agglomeration methods. With Gower coefficient, since it is not euclidean/metric, only average, single, complete methods should be considered consistent (but not Ward or centroid or median). You probably choose between average and complete (or try both), for single produces too oblong clusters.

There exist, of course, other clustering methods potentially apropriate (for example, a modification of K-Means that can take nominal variables), but I haven't use them, so can't recommend.

A good way to decide on the proper number of clusters is to use some internal clustering criterion, such as Silhouette statistic, cophenetic correlation, BIC, etc. (if you use SPSS, find macros to compute them on my web-page). In clustering, you produce and save a range of cluster solutions (say, from 20-cluster solution to 2-cluster solution) which are variables of cluster membership, and then check by one or more clustering criterions which of the solutions represent the most well-separated clusters - ideal solution is when density inside clusters is high and between them is low.