Solved – How to estimate the centroid of clustered sequences

clusteringrsequence analysistraminer

I have run a sequence analaysis using the Optimal Matching algorithm. Afterwards, I have clustered the resulting distance matrice using the Ward algorithm and calculated silhouettes as measures of cluster quality and to identify representative sequences.

Now, I am curious whether it is possible to estimate the sequences of the cluster centroids which, to my knowledege, must not be an original data point. How can I estimate the sequence of a centroid?

To get an idea of the different steps of the analysis, consider this manual example[1]:

library(TraMineR) 
library(WeightedCluster) 
data(mvad) 
mvad.alphabet <- c("employment", "FE", "HE", "joblessness", "school", "training") 
mvad.labels <- c("Employment", "Further Education", "Higher Education", "Joblessness", "School", "Training") 
mvad.scodes <- c("EM", "FE", "HE", "JL", "SC", "TR") 

## Define sequence objects
mvad.seq <- seqdef(mvad[, 17:86], alphabet = mvad.alphabet, states = mvad.scodes, labels = mvad.labels, weights = mvad$weight, xtstep = 6)

## Computing OM dissimilarities
mvad.dist <- seqdist(mvad.seq, method="HAM", sm="CONSTANT")

## Clustering
wardCluster <- hclust(as.dist(mvad.dist), method = "ward", members = mvad$weight)
clust4 <- cutree(wardCluster, k = 4)

## Silhouettes
sil <- wcSilhouetteObs(mvad.dist, clust4, weights = mvad$weight, measure = "ASWw")

## Sequence index plots ordered by representativeness
seqIplot(mvad.seq, group = clust4, sortv = sil)

In this example, it would be for example interesting to see whether the sequence of third cluster's centroid differes from the most representative, original sequences in the cluster which are printed at the very top of the sequence index plot. In other cases, the centroid sequence may even have a more idealtype character which does not exist in the original dataset but reflects certain typical structures.

_{[1] See for the example Studer, Matthias (2013). WeightedCluster Library Manual: A practical guide to creating typologies of trajectories in the social sciences with R. LIVES Working Papers, 24.}

Best Answer

The cluster centroid, i.e., the theoretical true center sequence which minimizes the sum of distances to all sequences in the cluster, is generally something virtual which would be defined as a mix of states at each position (similarly as the average between integer values can take non integer values).

TraMineR does not compute such virtual centers. However, it can compute the distance to the virtual center (for the used formula, see Studer, Ritschard, Gabadinho and Muller, 2011, Discrepancy analysis of state sequences, Sociological Methods and Research, Vol. 40(3), pp. 471-510).

The distance to the center is returned by the disscenter function. To get the distance to the center from the sequence with highest silhouette in each cluster, we first retrieve the indexes of those sequences.

## Looking for the index of the first sequence with max
## silhouette in each cluster
fclust <- factor(clust4)
levclust <- levels(factor(clust4))
imax.sil <- rep(NA,length(levclust))
for (i in 1:length(levclust)){
  max.sil <- max(sil[fclust==levclust[i]])
  imax.sil[i] <- 
    which(sil == max.sil & fclust == levclust[i])[1]
}
## computing distance to center
d.to.ctr <- disscenter(mvad.dist, group=fclust, 
           weights = mvad$weight)[imax.sil]
names(d.to.ctr) <- fclust[imax.sil]
d.to.ctr

Now, you may also consider comparing the sequence with maximum silhouette value to the medoid, i.e., the the sequence in the data with the smallest sum of distances to the other sequences in the cluster.

You get a plot of the medoid of each cluster with seqrplot

seqrplot(mvad.seq, group = fclust, dist.matrix = mvad.dist,
         criteria = "centrality", nrep=1)

Alternatively, you can retrieve the index number of the medoids, and then print or plot the medoids as follows

icenter <- disscenter(mvad.dist, group = clust4, 
            medoids.index="first", weights = mvad$weight)
print(mvad.seq[icenter,], format="SPS")
seqiplot(mvad.seq[icenter,])

You could indeed also compute the distances to the medoids by setting for instance refseq = icenter[1] in seqdist, for the distance to the medoid of the first cluster.

Using the seqrep.grp function from TraMineRextras package, you get representativeness quality measures of the medoids (see Gabadinho, Ritschard, Studer, Muller, 2011, "Extracting and Rendering Representative Sequences", In Fred, A., Dietz, J.L.G., Liu, K. & Filipe, J. (eds) Knowledge Discovery, Knowledge Engineering and Knowledge Management. Series: Communications in Computer and Information Science (CCIS). Volume 128, pp. 94-106. Springer-Verlag)

library(TraMineRextras)
seqrep.grp(mvad.seq, group = fclust, mdis = mvad.dist, 
           criteria = "centrality", nrep=1, ret = "both")

Hope this helps.

Related Solutions

Solved – Index plot for each cluster sorted by the silhouette

The silhouette is computed for each observation $i$ as

$s(i) = \frac{b(i) - a(i)}{\max(a(i), b(i))}$

where $a(i)$ is the average dissimilarity with members of the cluster to which $i$ belongs, and $b(i)$ the minimum average dissimilarity to members of another cluster.

The silhouette values of members of a cluster $k$ are at the same position as the values $k$ in the cluster membership vector cluster.object. So you do not have anything to do. Your seqIplot command will automatically produce one index plot for each cluster with the sequences sorted by their silhouette values in each cluster.

Sequences will be sorted bottom up from the lower to the highest silhouette value, meaning that the sequences with the best silhouette values for each cluster are at the top of the plots.

Hope this helps.

Solved – Entropy of state distribution, complexity index and turbulence for sequences of varying length

There are different ways of dealing with sequences of different length.

In TraMineR by setting right="DEL" in seqdef, the missing positions after the last valid state are set as void elements and will be ignored by all functions even when used with with.missing=TRUE.

With right=NA they are considered as missing values up to the length of the lengthier sequence. In that case, the effect of with.missing=TRUE is to turn the NA state into an additional explicit state of the alphabet. Results correspond to what we would obtain by filling the sequences with a designated symbol that would be added to the alphabet.

Therefore, assuming there is no missing element before the last valid state in each sequence, cross-sectional as well as longitudinal non-normalized entropies will be the same whether with.missing is set as TRUE or FALSE. Normalized values will change, however, since setting with.missing=TRUE increases the alphabet size by one unit.

To illustrate, let us consider four sequences of varying length and the same sequences filled with m tokens to make them of the same length.

library(TraMineR)

x1 <- "a-b-b-c"
x2 <- "a-a-b-b-b-b-c-c"
x3 <- "a-b"
x4 <- "a-a-b-b-c-c"

y1 <- "a-b-b-c-m-m-m-m"
y2 <- "a-a-b-b-b-b-c-c"
y3 <- "a-b-m-m-m-m-m-m"
y4 <- "a-a-b-b-c-c-m-m"

seqt    <- seqdef(c(x1,x2,x3,x4), right="DEL")
seqt.na <- seqdef(c(x1,x2,x3,x4), right=NA)
seqt.mm <- seqdef(c(y1,y2,y3,y4), right="DEL")

Now we consider five possibilities. The cross-sectional distributions can be plotted with

par(mfrow=c(2,3))
seqdplot(seqt, with.missing=F, withlegend=F)
seqdplot(seqt.na, with.missing=F, withlegend=F)
seqdplot(seqt, with.missing=T, withlegend=F)
seqdplot(seqt.na, with.missing=T, withlegend=F)
seqdplot(seqt.mm, with.missing=T, withlegend=F)
seqlegend(seqtmm)

and the transversal entropies for each situation are obtained as

te <- seqstatd(seqt, with.missing=F)$Entropy
    te.na <- seqstatd(seqt.na, with.missing=F)$Entropy
te.T <- seqstatd(seqt, with.missing=T)$Entropy
    te.na.T <- seqstatd(seqt.na, with.missing=T)$Entropy
te.mm <- seqstatd(seqt.mm)$Entropy
te.tab <- data.frame(te, te.na, te.T, te.na.T, te.mm)
round(te.tab, 3)

##       te te.na  te.T te.na.T te.mm
## [1] 0.000 0.000 0.000   0.000 0.000
## [2] 0.631 0.631 0.500   0.500 0.500
## [3] 0.000 0.000 0.000   0.406 0.406
## [4] 0.579 0.579 0.459   0.750 0.750
## [5] 0.631 0.631 0.500   0.750 0.750
## [6] 0.631 0.631 0.500   0.750 0.750
## [7] 0.000 0.000 0.000   0.406 0.406
## [8] 0.000 0.000 0.000   0.406 0.406

We observe that with with.missing=FALSE the computed entropy is the same whatever the value of the right attribute (first 2 columns). For with.missing=TRUE, the results differ. The difference between te.T and the first two columns is due to the normalizing factor, i.e., the entropy of the alphabet which has one more token (the missing token) when we set with.missing=TRUE.

Similar results hold for longitudinal entropies returned by the seqient TraMineR function.

The longitudinal entropy depends on the distribution only, not the sequence length. E.g., the first tow sequences have same longitudinal distribution and we get:

seqient(seqt)[1:2]

## [1] 0.9463946 0.9463946

The Turbulence depends on the length of the sequence. The Turbulence is defined by Elzinga (C. Elzinga and A. Liefbroer, 2007) as the log (in base 2) of the product between the number of subsequences of the DSS (sequence of distinct successive states) and the inverse of the normalized variance of the time spent in the states present in the sequence. This later normalized variance is obtained by dividing the variance by the maximum possible variance, and it is this maximum that depends on the sequence length.

seqST(seqt)[1:2]

## [1] 3.00000 4.79518

Likewise, the complexity index also depends on the sequence length. This index is defined (A. Gabadinho et al., 2011) as the geometric mean between the normalized entropy and the length of the DSS normalized by the length of the sequence. Thus, the sequence length affects the index through this latter normalization.

seqici(seqt)[1:2]

## [1] 0.7943109 0.5199985

The values returned by seqient and seqici will slightly change when used with the with.missing=TRUE attribute because of its effect on the entropy normalization factor.

Best Answer

Related Solutions

Solved – Index plot for each cluster sorted by the silhouette

Solved – Entropy of state distribution, complexity index and turbulence for sequences of varying length

Related Question