Latitude
and longitude
are scale variables. Time
is scale, too (I hope it is linear, not cyclic). Provider
is nominal variable. I see two options:
- Use Two-step cluster analysis. This is the method of choice if you have many (thousands) of objects (nodes) to cluster. This method has a nice option to detect outliers automatically; aside from this it is quite coarse method.
- Use Hierarchical cluster analysis basing it on Gower coefficient (look here for links where you could compute it). This clustering is appropriate if the number of objects is, say, up to 500. You should choose among several agglomeration methods. With Gower coefficient, since it is not euclidean/metric, only average, single, complete methods should be considered consistent (but not Ward or centroid or median). You probably choose between average and complete (or try both), for single produces too oblong clusters.
There exist, of course, other clustering methods potentially apropriate (for example, a modification of K-Means that can take nominal variables), but I haven't use them, so can't recommend.
A good way to decide on the proper number of clusters is to use some internal clustering criterion, such as Silhouette statistic, cophenetic correlation, BIC, etc. (if you use SPSS, find macros to compute them on my web-page). In clustering, you produce and save a range of cluster solutions (say, from 20-cluster solution to 2-cluster solution) which are variables of cluster membership, and then check by one or more clustering criterions which of the solutions represent the most well-separated clusters - ideal solution is when density inside clusters is high and between them is low.
A somewhat complicated answer to your first question about distances goes like this.
Rencher's Multivariate Analysis cites Lance and Williams (1967) who proposed a general (flexible beta) method underlying most of the existing hierarchical cluster analysis: for three objects A, B and C, be that clusters or individual points considered to be clusters of size 1, if clusters A and B are joined into AB, then the new distance from AB to C can be expressed by
$$
d(AB,C) = \alpha_A d(C,A) + \alpha_B d(C,B) + \beta d(A,B) + \gamma| d(C,A) - d(C,B) |,
$$
They suggested $\alpha_A + \alpha_B + \beta=1$, $\alpha_A = \alpha_B$, $\gamma=0$ and $\beta < 1$, although you could try other choices. E.g., single linkage is $\alpha_A = \alpha_B = 1/2$, $\beta=0$, $\gamma=-1/2$ (contracting the space), complete linkage is the same except for $\gamma=1/2$ (diluting the space), and Ward's method is $\alpha_A = (n_A + n_C)/(n_A + n_B + n_C)$, $\alpha_B = (n_B + n_C)/(n_A + n_B + n_C)$, $\beta= - n_C/(n_A + n_B + n_C)$, and $\gamma=0$ (somewhat space-contracting). The book discusses the properties of the algorithms, such as monotonicity and space contraction/dilution as functions of these parameters.
So an algorithm can be build just based on the distances, and does not need the data matrix. However, the distance matrix for typical problems (sample size $n$ > dimension of the data $p$) takes more memory than the data matrix, so this may not be a very efficient way of approaching the problem, unless of course the data matrix is all you have.
Best Answer
You cannot use k-means then.
You don't only need to have a working distance function, but you also need to have a way of computing means that is appropriate for the distance function.
The arithmetic mean and the Euclidean distance work together. Their combination makes k-means terminate: updating the means reduces variance, and reassigning points also, thus it will converge.
However, what would be the mean of "american, canadian, canadian, chilean, chinese, chinese, american"?
Sorry, but k-means is only sensible for euclidean vector spaces where the distance and mean play together well. And it has other limitations, e.g. assuming that clusters are approximately equal in size and linear separable.