Solved – Entropy of state distribution, complexity index and turbulence for sequences of varying length

rsequence analysistraminer

I have a dataset of varying sequence lengths and want to calculate descriptive measured for it. The main cause of differing lengths is that some cases are right censored, and are dealt properly when defining the sequence through the right="DEL" argument in TraMineR's seqdef.

My questions are:

  1. Will the entropy of state distribution calculated by seqstatd with the option with.missing=FALSE, weighted=FALSE and norm=TRUE be affected by the variation in length?

  2. Will the sequence turbulence calculated by seqST be affected by the variation in length?

  3. Will the complexity index calculated with seqici() with the option with.missing=FALSE be affected by the varying sequence lengths?

Best Answer

There are different ways of dealing with sequences of different length.

In TraMineR by setting right="DEL" in seqdef, the missing positions after the last valid state are set as void elements and will be ignored by all functions even when used with with.missing=TRUE.

With right=NA they are considered as missing values up to the length of the lengthier sequence. In that case, the effect of with.missing=TRUE is to turn the NA state into an additional explicit state of the alphabet. Results correspond to what we would obtain by filling the sequences with a designated symbol that would be added to the alphabet.

Therefore, assuming there is no missing element before the last valid state in each sequence, cross-sectional as well as longitudinal non-normalized entropies will be the same whether with.missing is set as TRUE or FALSE. Normalized values will change, however, since setting with.missing=TRUE increases the alphabet size by one unit.

To illustrate, let us consider four sequences of varying length and the same sequences filled with m tokens to make them of the same length.

library(TraMineR)

x1 <- "a-b-b-c"
x2 <- "a-a-b-b-b-b-c-c"
x3 <- "a-b"
x4 <- "a-a-b-b-c-c"

y1 <- "a-b-b-c-m-m-m-m"
y2 <- "a-a-b-b-b-b-c-c"
y3 <- "a-b-m-m-m-m-m-m"
y4 <- "a-a-b-b-c-c-m-m"

seqt    <- seqdef(c(x1,x2,x3,x4), right="DEL")
seqt.na <- seqdef(c(x1,x2,x3,x4), right=NA)
seqt.mm <- seqdef(c(y1,y2,y3,y4), right="DEL")

Now we consider five possibilities. The cross-sectional distributions can be plotted with

par(mfrow=c(2,3))
seqdplot(seqt, with.missing=F, withlegend=F)
seqdplot(seqt.na, with.missing=F, withlegend=F)
seqdplot(seqt, with.missing=T, withlegend=F)
seqdplot(seqt.na, with.missing=T, withlegend=F)
seqdplot(seqt.mm, with.missing=T, withlegend=F)
seqlegend(seqtmm)

and the transversal entropies for each situation are obtained as

te <- seqstatd(seqt, with.missing=F)$Entropy
    te.na <- seqstatd(seqt.na, with.missing=F)$Entropy
te.T <- seqstatd(seqt, with.missing=T)$Entropy
    te.na.T <- seqstatd(seqt.na, with.missing=T)$Entropy
te.mm <- seqstatd(seqt.mm)$Entropy
te.tab <- data.frame(te, te.na, te.T, te.na.T, te.mm)
round(te.tab, 3)

##       te te.na  te.T te.na.T te.mm
## [1] 0.000 0.000 0.000   0.000 0.000
## [2] 0.631 0.631 0.500   0.500 0.500
## [3] 0.000 0.000 0.000   0.406 0.406
## [4] 0.579 0.579 0.459   0.750 0.750
## [5] 0.631 0.631 0.500   0.750 0.750
## [6] 0.631 0.631 0.500   0.750 0.750
## [7] 0.000 0.000 0.000   0.406 0.406
## [8] 0.000 0.000 0.000   0.406 0.406

We observe that with with.missing=FALSE the computed entropy is the same whatever the value of the right attribute (first 2 columns). For with.missing=TRUE, the results differ. The difference between te.T and the first two columns is due to the normalizing factor, i.e., the entropy of the alphabet which has one more token (the missing token) when we set with.missing=TRUE.

Similar results hold for longitudinal entropies returned by the seqient TraMineR function.

The longitudinal entropy depends on the distribution only, not the sequence length. E.g., the first tow sequences have same longitudinal distribution and we get:

seqient(seqt)[1:2]

## [1] 0.9463946 0.9463946

The Turbulence depends on the length of the sequence. The Turbulence is defined by Elzinga (C. Elzinga and A. Liefbroer, 2007) as the log (in base 2) of the product between the number of subsequences of the DSS (sequence of distinct successive states) and the inverse of the normalized variance of the time spent in the states present in the sequence. This later normalized variance is obtained by dividing the variance by the maximum possible variance, and it is this maximum that depends on the sequence length.

seqST(seqt)[1:2]

## [1] 3.00000 4.79518

Likewise, the complexity index also depends on the sequence length. This index is defined (A. Gabadinho et al., 2011) as the geometric mean between the normalized entropy and the length of the DSS normalized by the length of the sequence. Thus, the sequence length affects the index through this latter normalization.

seqici(seqt)[1:2]

## [1] 0.7943109 0.5199985

The values returned by seqient and seqici will slightly change when used with the with.missing=TRUE attribute because of its effect on the entropy normalization factor.

Related Question