The cluster centroid, i.e., the theoretical true center sequence which minimizes the sum of distances to all sequences in the cluster, is generally something virtual which would be defined as a mix of states at each position (similarly as the average between integer values can take non integer values).
TraMineR
does not compute such virtual centers. However, it can compute the distance to the virtual center (for the used formula, see Studer, Ritschard, Gabadinho and Muller, 2011, Discrepancy analysis of state sequences, Sociological Methods and Research, Vol. 40(3), pp. 471-510).
The distance to the center is returned by the disscenter
function. To get the distance to the center from the sequence with highest silhouette in each cluster, we first retrieve the indexes of those sequences.
## Looking for the index of the first sequence with max
## silhouette in each cluster
fclust <- factor(clust4)
levclust <- levels(factor(clust4))
imax.sil <- rep(NA,length(levclust))
for (i in 1:length(levclust)){
max.sil <- max(sil[fclust==levclust[i]])
imax.sil[i] <-
which(sil == max.sil & fclust == levclust[i])[1]
}
## computing distance to center
d.to.ctr <- disscenter(mvad.dist, group=fclust,
weights = mvad$weight)[imax.sil]
names(d.to.ctr) <- fclust[imax.sil]
d.to.ctr
Now, you may also consider comparing the sequence with maximum silhouette value to the medoid, i.e., the the sequence in the data with the smallest sum of distances to the other sequences in the cluster.
You get a plot of the medoid of each cluster with seqrplot
seqrplot(mvad.seq, group = fclust, dist.matrix = mvad.dist,
criteria = "centrality", nrep=1)
Alternatively, you can retrieve the index number of the medoids, and then print or plot the medoids as follows
icenter <- disscenter(mvad.dist, group = clust4,
medoids.index="first", weights = mvad$weight)
print(mvad.seq[icenter,], format="SPS")
seqiplot(mvad.seq[icenter,])
You could indeed also compute the distances to the medoids by setting for instance refseq = icenter[1]
in seqdist
, for the distance to the medoid of the first cluster.
Using the seqrep.grp
function from TraMineRextras
package, you get representativeness quality measures of the medoids (see Gabadinho, Ritschard, Studer, Muller, 2011, "Extracting and Rendering Representative Sequences", In Fred, A., Dietz, J.L.G., Liu, K. & Filipe, J. (eds) Knowledge Discovery, Knowledge Engineering and Knowledge Management. Series: Communications in Computer and Information Science (CCIS). Volume 128, pp. 94-106. Springer-Verlag)
library(TraMineRextras)
seqrep.grp(mvad.seq, group = fclust, mdis = mvad.dist,
criteria = "centrality", nrep=1, ret = "both")
Hope this helps.
As @Anony-Mousse implies, it isn't clear right now that your data actually are clusterable. In the end, you may choose to simply chop your data into partitions, if that will serve your business purposes, but there may not be any real latent groupings.
From where I sit, I cannot provide any guaranteed solutions, but perhaps I can offer some suggestions that will be profitable:
- You have a single clear outlier (in the upper right corner of the [2,3] scatterplot, e.g.) that will likely distort any analysis you try. You may want to try to investigate that store separately. In the interim, I would set that point aside.
- It isn't clear how much data you have, but it looks like a lot. You state that you have "under 4k observations". If it is close to that amount, say >3k, then you have a lot. Since a good deal of exploratory data analysis will be necessary, I would randomly partition your data into two halves and explore the first half and then validate your choices with the other half afterwards.
- I would experiment with various transformations of your variables to see if you can get better (i.e., more spherical) distributions. For example, taking the logarithm of your data may be appropriate. After finding a suitable transformation, check again for outliers.
- Then you will need to standardize each variable so that its mean is 0 and standard deviation is 1. Be sure to keep the original mean and SD for each variable so that you can apply exactly the same transformation later when you work with the second set.
At this point (and only now), you can try clustering. I would not use k-means or k-medoids. Since you will have overlapping clusters, you will need a method that can handle that. The clustering algorithms I am familiar with that can do so are fuzzy k-means, Gaussian mixture modeling, and clustering by kernel density estimation.
- Fuzzy k-means is discussed on CV here; you can also try this search. To perform fuzzy k-means in R, you can use ?fanny.
- Threads about Gaussian mixture modeling can be found on the site with the gaussian-mixture tag. Finite mixture modeling can be done in R with the mclust package. I have demonstrated GMM with
mclust
on CV here and here.
- Clustering by kernel density estimation is probably more esoteric. You can read the original paper1 here. You can use kernel densities to cluster in R with the pdfCluster package.
There is a continuity here: Fuzzy k-means essentially approximates GMM, but imposes sphericality on your clusters, which GMM does not do. GMMs make a very strong assumption that each cluster is multivariate normal (albeit possibly with different variances and covariances). If that isn't (nearly) perfectly true, the results can be distorted. Moreover, although kernel density estimates use a multivariate Gaussian kernel by default, the end result can be much more flexible and needn't yield multivariate normal clusters at all. This line of reasoning may suggest you simply go with the latter, but if the former constraints / assumptions hold they will benefit your analysis.
- You mention a variety of cluster validation metrics that you are using. Those are valuable, but I would select the method and the final clustering solution by which possibility makes sense of the data given your knowledge of the topic and whether it provides actionable business intelligence. You should also try to visualize the clusters in various ways.
- Check your chosen strategy by performing the exact same preprocessing and clustering on the other half of your data and see if you get similar and equally coherent / valuable results.
1. Azzalini, A. & Torelli, N. (2007). Clustering via nonparametric density estimation, Statisticis and Computing, 17, 1, pp. 71-80.
Best Answer
The silhouette is computed for each observation $i$ as
$s(i) = \frac{b(i) - a(i)}{\max(a(i), b(i))}$
where $a(i)$ is the average dissimilarity with members of the cluster to which $i$ belongs, and $b(i)$ the minimum average dissimilarity to members of another cluster.
The silhouette values of members of a cluster $k$ are at the same position as the values $k$ in the cluster membership vector
cluster.object
. So you do not have anything to do. YourseqIplot
command will automatically produce one index plot for each cluster with the sequences sorted by their silhouette values in each cluster.Sequences will be sorted bottom up from the lower to the highest silhouette value, meaning that the sequences with the best silhouette values for each cluster are at the top of the plots.
Hope this helps.