Solved – How to estimate the centroid of clustered sequences

clusteringrsequence analysistraminer

I have run a sequence analaysis using the Optimal Matching algorithm. Afterwards, I have clustered the resulting distance matrice using the Ward algorithm and calculated silhouettes as measures of cluster quality and to identify representative sequences.

Now, I am curious whether it is possible to estimate the sequences of the cluster centroids which, to my knowledege, must not be an original data point. How can I estimate the sequence of a centroid?

To get an idea of the different steps of the analysis, consider this manual example[1]:

library(TraMineR) 
library(WeightedCluster) 
data(mvad) 
mvad.alphabet <- c("employment", "FE", "HE", "joblessness", "school", "training") 
mvad.labels <- c("Employment", "Further Education", "Higher Education", "Joblessness", "School", "Training") 
mvad.scodes <- c("EM", "FE", "HE", "JL", "SC", "TR") 

## Define sequence objects
mvad.seq <- seqdef(mvad[, 17:86], alphabet = mvad.alphabet, states = mvad.scodes, labels = mvad.labels, weights = mvad$weight, xtstep = 6)

## Computing OM dissimilarities
mvad.dist <- seqdist(mvad.seq, method="HAM", sm="CONSTANT")

## Clustering
wardCluster <- hclust(as.dist(mvad.dist), method = "ward", members = mvad$weight)
clust4 <- cutree(wardCluster, k = 4)

## Silhouettes
sil <- wcSilhouetteObs(mvad.dist, clust4, weights = mvad$weight, measure = "ASWw")

## Sequence index plots ordered by representativeness
seqIplot(mvad.seq, group = clust4, sortv = sil)

In this example, it would be for example interesting to see whether the sequence of third cluster's centroid differes from the most representative, original sequences in the cluster which are printed at the very top of the sequence index plot. In other cases, the centroid sequence may even have a more idealtype character which does not exist in the original dataset but reflects certain typical structures.

[1] See for the example Studer, Matthias (2013). WeightedCluster Library Manual: A practical guide to creating typologies of trajectories in the social sciences with R. LIVES Working Papers, 24.

Best Answer

The cluster centroid, i.e., the theoretical true center sequence which minimizes the sum of distances to all sequences in the cluster, is generally something virtual which would be defined as a mix of states at each position (similarly as the average between integer values can take non integer values).

TraMineR does not compute such virtual centers. However, it can compute the distance to the virtual center (for the used formula, see Studer, Ritschard, Gabadinho and Muller, 2011, Discrepancy analysis of state sequences, Sociological Methods and Research, Vol. 40(3), pp. 471-510).

The distance to the center is returned by the disscenter function. To get the distance to the center from the sequence with highest silhouette in each cluster, we first retrieve the indexes of those sequences.

## Looking for the index of the first sequence with max
## silhouette in each cluster
fclust <- factor(clust4)
levclust <- levels(factor(clust4))
imax.sil <- rep(NA,length(levclust))
for (i in 1:length(levclust)){
  max.sil <- max(sil[fclust==levclust[i]])
  imax.sil[i] <- 
    which(sil == max.sil & fclust == levclust[i])[1]
}
## computing distance to center
d.to.ctr <- disscenter(mvad.dist, group=fclust, 
           weights = mvad$weight)[imax.sil]
names(d.to.ctr) <- fclust[imax.sil]
d.to.ctr

Now, you may also consider comparing the sequence with maximum silhouette value to the medoid, i.e., the the sequence in the data with the smallest sum of distances to the other sequences in the cluster.

You get a plot of the medoid of each cluster with seqrplot

seqrplot(mvad.seq, group = fclust, dist.matrix = mvad.dist,
         criteria = "centrality", nrep=1)

Alternatively, you can retrieve the index number of the medoids, and then print or plot the medoids as follows

icenter <- disscenter(mvad.dist, group = clust4, 
            medoids.index="first", weights = mvad$weight)
print(mvad.seq[icenter,], format="SPS")
seqiplot(mvad.seq[icenter,])

You could indeed also compute the distances to the medoids by setting for instance refseq = icenter[1] in seqdist, for the distance to the medoid of the first cluster.

Using the seqrep.grp function from TraMineRextras package, you get representativeness quality measures of the medoids (see Gabadinho, Ritschard, Studer, Muller, 2011, "Extracting and Rendering Representative Sequences", In Fred, A., Dietz, J.L.G., Liu, K. & Filipe, J. (eds) Knowledge Discovery, Knowledge Engineering and Knowledge Management. Series: Communications in Computer and Information Science (CCIS). Volume 128, pp. 94-106. Springer-Verlag)

library(TraMineRextras)
seqrep.grp(mvad.seq, group = fclust, mdis = mvad.dist, 
           criteria = "centrality", nrep=1, ret = "both")

Hope this helps.

Related Question