I have a dataset of varying sequence lengths and want to calculate descriptive measured for it. The main cause of differing lengths is that some cases are right censored, and are dealt properly when defining the sequence through the right="DEL"
argument in TraMineR's seqdef
.
My questions are:
-
Will the entropy of state distribution calculated by
seqstatd
with the optionwith.missing=FALSE
,weighted=FALSE
andnorm=TRUE
be affected by the variation in length? -
Will the sequence turbulence calculated by
seqST
be affected by the variation in length? -
Will the complexity index calculated with
seqici()
with the optionwith.missing=FALSE
be affected by the varying sequence lengths?
Best Answer
There are different ways of dealing with sequences of different length.
In
TraMineR
by settingright="DEL"
inseqdef
, the missing positions after the last valid state are set asvoid
elements and will be ignored by all functions even when used withwith.missing=TRUE
.With
right=NA
they are considered as missing values up to the length of the lengthier sequence. In that case, the effect ofwith.missing=TRUE
is to turn the NA state into an additional explicit state of the alphabet. Results correspond to what we would obtain by filling the sequences with a designated symbol that would be added to the alphabet.Therefore, assuming there is no missing element before the last valid state in each sequence, cross-sectional as well as longitudinal non-normalized entropies will be the same whether
with.missing
is set asTRUE
orFALSE
. Normalized values will change, however, since settingwith.missing=TRUE
increases the alphabet size by one unit.To illustrate, let us consider four sequences of varying length and the same sequences filled with
m
tokens to make them of the same length.Now we consider five possibilities. The cross-sectional distributions can be plotted with
and the transversal entropies for each situation are obtained as
We observe that with
with.missing=FALSE
the computed entropy is the same whatever the value of theright
attribute (first 2 columns). Forwith.missing=TRUE
, the results differ. The difference betweente.T
and the first two columns is due to the normalizing factor, i.e., the entropy of the alphabet which has one more token (the missing token) when we setwith.missing=TRUE
.Similar results hold for longitudinal entropies returned by the
seqient
TraMineR function.The longitudinal entropy depends on the distribution only, not the sequence length. E.g., the first tow sequences have same longitudinal distribution and we get:
The Turbulence depends on the length of the sequence. The Turbulence is defined by Elzinga (C. Elzinga and A. Liefbroer, 2007) as the log (in base 2) of the product between the number of subsequences of the DSS (sequence of distinct successive states) and the inverse of the normalized variance of the time spent in the states present in the sequence. This later normalized variance is obtained by dividing the variance by the maximum possible variance, and it is this maximum that depends on the sequence length.
Likewise, the complexity index also depends on the sequence length. This index is defined (A. Gabadinho et al., 2011) as the geometric mean between the normalized entropy and the length of the DSS normalized by the length of the sequence. Thus, the sequence length affects the index through this latter normalization.
The values returned by
seqient
andseqici
will slightly change when used with thewith.missing=TRUE
attribute because of its effect on the entropy normalization factor.