The silhouette is computed for each observation $i$ as
$s(i) = \frac{b(i) - a(i)}{\max(a(i), b(i))}$
where $a(i)$ is the average dissimilarity with members of the cluster to which $i$ belongs, and $b(i)$ the minimum average dissimilarity to members of another cluster.
The silhouette values of members of a cluster $k$ are at the same position as the values $k$ in the cluster membership vector cluster.object
. So you do not have anything to do.
Your seqIplot
command will automatically produce one index plot for each cluster with the sequences sorted by their silhouette values in each cluster.
Sequences will be sorted bottom up from the lower to the highest silhouette value, meaning that the sequences with the best silhouette values for each cluster are at the top of the plots.
Hope this helps.
There are different ways of dealing with sequences of different length.
In TraMineR
by setting right="DEL"
in seqdef
, the missing positions after the last valid state are set as void
elements and will be ignored by all functions even when used with with.missing=TRUE
.
With right=NA
they are considered as missing values up to the length of the lengthier sequence. In that case, the effect of with.missing=TRUE
is to turn the NA state into an additional explicit state of the alphabet. Results correspond to what we would obtain by filling the sequences with a designated symbol that would be added to the alphabet.
Therefore, assuming there is no missing element before the last valid state in
each sequence, cross-sectional as well as longitudinal non-normalized
entropies will be the same whether with.missing
is set as TRUE
or
FALSE
. Normalized values will change, however, since setting
with.missing=TRUE
increases the alphabet size by one unit.
To illustrate, let us consider four sequences of varying length and the same sequences filled with m
tokens to make them of the same length.
library(TraMineR)
x1 <- "a-b-b-c"
x2 <- "a-a-b-b-b-b-c-c"
x3 <- "a-b"
x4 <- "a-a-b-b-c-c"
y1 <- "a-b-b-c-m-m-m-m"
y2 <- "a-a-b-b-b-b-c-c"
y3 <- "a-b-m-m-m-m-m-m"
y4 <- "a-a-b-b-c-c-m-m"
seqt <- seqdef(c(x1,x2,x3,x4), right="DEL")
seqt.na <- seqdef(c(x1,x2,x3,x4), right=NA)
seqt.mm <- seqdef(c(y1,y2,y3,y4), right="DEL")
Now we consider five possibilities. The cross-sectional distributions can be plotted with
par(mfrow=c(2,3))
seqdplot(seqt, with.missing=F, withlegend=F)
seqdplot(seqt.na, with.missing=F, withlegend=F)
seqdplot(seqt, with.missing=T, withlegend=F)
seqdplot(seqt.na, with.missing=T, withlegend=F)
seqdplot(seqt.mm, with.missing=T, withlegend=F)
seqlegend(seqtmm)
and the transversal entropies for each situation are obtained as
te <- seqstatd(seqt, with.missing=F)$Entropy
te.na <- seqstatd(seqt.na, with.missing=F)$Entropy
te.T <- seqstatd(seqt, with.missing=T)$Entropy
te.na.T <- seqstatd(seqt.na, with.missing=T)$Entropy
te.mm <- seqstatd(seqt.mm)$Entropy
te.tab <- data.frame(te, te.na, te.T, te.na.T, te.mm)
round(te.tab, 3)
## te te.na te.T te.na.T te.mm
## [1] 0.000 0.000 0.000 0.000 0.000
## [2] 0.631 0.631 0.500 0.500 0.500
## [3] 0.000 0.000 0.000 0.406 0.406
## [4] 0.579 0.579 0.459 0.750 0.750
## [5] 0.631 0.631 0.500 0.750 0.750
## [6] 0.631 0.631 0.500 0.750 0.750
## [7] 0.000 0.000 0.000 0.406 0.406
## [8] 0.000 0.000 0.000 0.406 0.406
We observe that with with.missing=FALSE
the computed entropy is the same whatever the value of the right
attribute (first 2 columns). For with.missing=TRUE
, the results differ. The difference between te.T
and the first two columns is due to the normalizing factor, i.e., the entropy of the alphabet which has one more token (the missing token) when we set with.missing=TRUE
.
Similar results hold for longitudinal entropies returned by the seqient
TraMineR function.
The longitudinal entropy depends on the distribution only, not the sequence length. E.g., the first tow sequences have same longitudinal distribution and we get:
seqient(seqt)[1:2]
## [1] 0.9463946 0.9463946
The Turbulence depends on the length of the sequence. The Turbulence is defined by Elzinga (C. Elzinga and A. Liefbroer, 2007) as the log (in base 2) of the product between the number of subsequences of the DSS (sequence of distinct successive states) and the inverse of the normalized variance of the time spent in the states present in the sequence. This later normalized variance is obtained by dividing the variance by the maximum possible variance, and it is this maximum that depends on the sequence length.
seqST(seqt)[1:2]
## [1] 3.00000 4.79518
Likewise, the complexity index also depends on the sequence length. This index is defined (A. Gabadinho et al., 2011) as the geometric mean between the normalized entropy and the length of the DSS normalized by the length of the sequence. Thus, the sequence length affects the index through this latter normalization.
seqici(seqt)[1:2]
## [1] 0.7943109 0.5199985
The values returned by seqient
and seqici
will slightly change when used with the with.missing=TRUE
attribute because of its effect on the entropy normalization factor.
Best Answer
The cluster centroid, i.e., the theoretical true center sequence which minimizes the sum of distances to all sequences in the cluster, is generally something virtual which would be defined as a mix of states at each position (similarly as the average between integer values can take non integer values).
TraMineR
does not compute such virtual centers. However, it can compute the distance to the virtual center (for the used formula, see Studer, Ritschard, Gabadinho and Muller, 2011, Discrepancy analysis of state sequences, Sociological Methods and Research, Vol. 40(3), pp. 471-510).The distance to the center is returned by the
disscenter
function. To get the distance to the center from the sequence with highest silhouette in each cluster, we first retrieve the indexes of those sequences.Now, you may also consider comparing the sequence with maximum silhouette value to the medoid, i.e., the the sequence in the data with the smallest sum of distances to the other sequences in the cluster.
You get a plot of the medoid of each cluster with
seqrplot
Alternatively, you can retrieve the index number of the medoids, and then print or plot the medoids as follows
You could indeed also compute the distances to the medoids by setting for instance
refseq = icenter[1]
inseqdist
, for the distance to the medoid of the first cluster.Using the
seqrep.grp
function fromTraMineRextras
package, you get representativeness quality measures of the medoids (see Gabadinho, Ritschard, Studer, Muller, 2011, "Extracting and Rendering Representative Sequences", In Fred, A., Dietz, J.L.G., Liu, K. & Filipe, J. (eds) Knowledge Discovery, Knowledge Engineering and Knowledge Management. Series: Communications in Computer and Information Science (CCIS). Volume 128, pp. 94-106. Springer-Verlag)Hope this helps.