Solved – When and how to use weights for sequence analysis in social science

rsequence analysisstatatraminerweighted-sampling

Weighting in sequence analysis

So far, I have scarcely found papers that address the issue of weighting for sequence analysis (using for example the optimal matching algorithm). Sequence analysis normally involves several steps:

setting or calculation of substitution and insertion/deletion costs,
computation of distance matrices and
following cluster analyses or discrepancy analyses[1].

At least, the R-package TraMineR (see Gabadinho et al. 2010 and Gabadinho et al. 2011, p. 11) and the Stata-ado SEQCOMP by Laurent Lesnard makes it possible to include weights at step 1 and 3.
Furthermore, Lesnard explicitly recommends the usage of sample weights for steps 1 and 3:

"Sample weights should only be used to calculate transition matrices, and consequently
substitution costs. Instead of counting the number of transitions, it is simply
the weighted number of transitions that should be taken into account. The
matching procedure in itself, namely, the comparison of pair of sequences, does
not require any weights; it is by definition a one to one procedure. However, sample
weights should be turned on to interpret results, for instance, if cluster analysis
is used, the size of the clusters obtained must be weighted."
Lesnard (2010: 415, endnote 12)

Open questions

Nonetheless, there does not seem to be a consensus in the literature when and which weights are needed or useful.

What do you think is the best rationale for applying weights in sequence analysis?
When should sequences be weighted?
Do you use cross-sectional sampling weights or longitudinal weights accounting for sampling probabilities as well as panel attrition?
How do you apply weights if you have unbalanced panel data?
The usage of weights in TraMineR is well documented; but do you have examples for the usage of weights with a Stata-ado?

References

Gabadinho, Alexis, Gilbert Ritschard, Matthias Studer and Nicolas S. Müller (2010): Mining sequence data in R with the TraMineR package: A user's guide,
University of Geneva.
Gabadinho, Alexis, Gilbert Ritschard, Nicolas S. Müller and Matthias Studer(2011): Analyzing and visualizing state sequences in R with TraMineR, in: Journal of Statistical Software, Vol. 40, No. 4, pp. 1-37.
Lesnard, Laurent (2010): Setting Cost in Optimal Matching to Uncover Contemporaneous Socio-Temporal Patterns, in: Sociological Methods and Research, Vol. 38, No. 3, pp. 389-419.
Studer, Matthias, Gilbert Ritschard, Alexis Gabadinho and Nicolas S. Müller (2011): Discrepancy Analysis of State Sequences, in: Sociological Methods and Research. Vol. 40, No. 3, pp. 471-510.

_{[1] See Studer et al. (2011) for a presentation of discrepancy analysis that is an ANOVA like approach for distance matrices.}

Best Answer

I assume that you are using sampling weights to correct for representativity bias. Please note that some "data providers" require you to use the weights in your publications.

In my opinion, you should always use weights for descriptive analysis in order to get unbiased results. I think that there are more consensus for this kind of analysis. Descriptive analysis includes cluster analysis, sequences visualization, computation of transitions rates (and hence substitution costs based on them), for instance. For weighted cluster analysis, you can have a look at the WeightedCluster library and manual.

Regarding the weights to use, I would recommend to use longitudinal weights, since the sequences are defined for the whole period, but it depends on the exact weight definition. For a more general answer, you need to answer the following questions:

What sample do I have (at what time, and so on)?
to which population do I want to generalize?

In some panels, longitudinal weights use the sample defined by wave t and generalize it to the population at wave one. This is what you want if you want to follow the evolution at wave one.

Related Solutions

Solved – How to use weights for imbalanced data in R’s randomForest

Ok, so I found part of my answer but not the good part. It turns out the randomForest package can do stratified sampling but only for classification. Here is a link to the package author's explanation.

I'm still looking for ideas on how to do stratified sampling for regression rf's.

Solved – How to estimate the centroid of clustered sequences

The cluster centroid, i.e., the theoretical true center sequence which minimizes the sum of distances to all sequences in the cluster, is generally something virtual which would be defined as a mix of states at each position (similarly as the average between integer values can take non integer values).

TraMineR does not compute such virtual centers. However, it can compute the distance to the virtual center (for the used formula, see Studer, Ritschard, Gabadinho and Muller, 2011, Discrepancy analysis of state sequences, Sociological Methods and Research, Vol. 40(3), pp. 471-510).

The distance to the center is returned by the disscenter function. To get the distance to the center from the sequence with highest silhouette in each cluster, we first retrieve the indexes of those sequences.

## Looking for the index of the first sequence with max
## silhouette in each cluster
fclust <- factor(clust4)
levclust <- levels(factor(clust4))
imax.sil <- rep(NA,length(levclust))
for (i in 1:length(levclust)){
  max.sil <- max(sil[fclust==levclust[i]])
  imax.sil[i] <- 
    which(sil == max.sil & fclust == levclust[i])[1]
}
## computing distance to center
d.to.ctr <- disscenter(mvad.dist, group=fclust, 
           weights = mvad$weight)[imax.sil]
names(d.to.ctr) <- fclust[imax.sil]
d.to.ctr

Now, you may also consider comparing the sequence with maximum silhouette value to the medoid, i.e., the the sequence in the data with the smallest sum of distances to the other sequences in the cluster.

You get a plot of the medoid of each cluster with seqrplot

seqrplot(mvad.seq, group = fclust, dist.matrix = mvad.dist,
         criteria = "centrality", nrep=1)

Alternatively, you can retrieve the index number of the medoids, and then print or plot the medoids as follows

icenter <- disscenter(mvad.dist, group = clust4, 
            medoids.index="first", weights = mvad$weight)
print(mvad.seq[icenter,], format="SPS")
seqiplot(mvad.seq[icenter,])

You could indeed also compute the distances to the medoids by setting for instance refseq = icenter[1] in seqdist, for the distance to the medoid of the first cluster.

Using the seqrep.grp function from TraMineRextras package, you get representativeness quality measures of the medoids (see Gabadinho, Ritschard, Studer, Muller, 2011, "Extracting and Rendering Representative Sequences", In Fred, A., Dietz, J.L.G., Liu, K. & Filipe, J. (eds) Knowledge Discovery, Knowledge Engineering and Knowledge Management. Series: Communications in Computer and Information Science (CCIS). Volume 128, pp. 94-106. Springer-Verlag)

library(TraMineRextras)
seqrep.grp(mvad.seq, group = fclust, mdis = mvad.dist, 
           criteria = "centrality", nrep=1, ret = "both")

Hope this helps.