I am not a statistician, so please excuse my lack of statistics knowledge/terminology.
I have bunch of network nodes that I want to run cluster analysis on and identify clusters. So as far as I understand, I can follow the following steps to run a hierarchical agglomerative analysis (HAC):
- Identify variables
- Define a distance function
- Run the algorithm to join closer clusters and create one big cluster
- Cut the dendrogram tree at the height that makes meaningful clusters based on context
My question is related to the second step even though it is not yet clear for me how I am going to do the last step.
The data I want to analyse is bunch of computer network nodes which are down (not responding). I have the following information for each network node:
- location (longitude, latitude)
- time node went down
- network provider
These are the most relevant information I believe I have to take into consideration for my clustering. Basically I want to cluster the nodes that went down probably because of the same reason in a region.
For example if bunch of nodes went down at about the same time and physically they are close to each other and they have the same provider, so probably they fall into the same cluster.
Now the question is how do I derive my distance function and include all these variables in it such that it would make sense? In other words what is the mechanism to derive a distance function based on multiple variables?
Also as you notice the variables are of different types. So, in the distance function should I take care of this by considering Gower's coefficient of similarity? How?
Any examples or suggestion regarding whether I am in the right direction or not can be very helpful too.
Best Answer
Latitude
andlongitude
are scale variables.Time
is scale, too (I hope it is linear, not cyclic).Provider
is nominal variable. I see two options:There exist, of course, other clustering methods potentially apropriate (for example, a modification of K-Means that can take nominal variables), but I haven't use them, so can't recommend.
A good way to decide on the proper number of clusters is to use some internal clustering criterion, such as Silhouette statistic, cophenetic correlation, BIC, etc. (if you use SPSS, find macros to compute them on my web-page). In clustering, you produce and save a range of cluster solutions (say, from 20-cluster solution to 2-cluster solution) which are variables of cluster membership, and then check by one or more clustering criterions which of the solutions represent the most well-separated clusters - ideal solution is when density inside clusters is high and between them is low.