Solved – When and how to use weights for sequence analysis in social science

rsequence analysisstatatraminerweighted-sampling

Weighting in sequence analysis

So far, I have scarcely found papers that address the issue of weighting for sequence analysis (using for example the optimal matching algorithm). Sequence analysis normally involves several steps:

  1. setting or calculation of substitution and insertion/deletion costs,
  2. computation of distance matrices and
  3. following cluster analyses or discrepancy analyses[1].

At least, the R-package TraMineR (see Gabadinho et al. 2010 and Gabadinho et al. 2011, p. 11) and the Stata-ado SEQCOMP by Laurent Lesnard makes it possible to include weights at step 1 and 3.
Furthermore, Lesnard explicitly recommends the usage of sample weights for steps 1 and 3:

"Sample weights should only be used to calculate transition matrices, and consequently
substitution costs. Instead of counting the number of transitions, it is simply
the weighted number of transitions that should be taken into account. The
matching procedure in itself, namely, the comparison of pair of sequences, does
not require any weights; it is by definition a one to one procedure. However, sample
weights should be turned on to interpret results, for instance, if cluster analysis
is used, the size of the clusters obtained must be weighted."
Lesnard (2010: 415, endnote 12)

Open questions

Nonetheless, there does not seem to be a consensus in the literature when and which weights are needed or useful.

  • What do you think is the best rationale for applying weights in sequence analysis?
  • When should sequences be weighted?
  • Do you use cross-sectional sampling weights or longitudinal weights accounting for sampling probabilities as well as panel attrition?
  • How do you apply weights if you have unbalanced panel data?
  • The usage of weights in TraMineR is well documented; but do you have examples for the usage of weights with a Stata-ado?

References

  • Gabadinho, Alexis, Gilbert Ritschard, Matthias Studer and Nicolas S. Müller (2010): Mining sequence data in R with the TraMineR package: A user's guide,
    University of Geneva.
  • Gabadinho, Alexis, Gilbert Ritschard, Nicolas S. Müller and Matthias Studer(2011): Analyzing and visualizing state sequences in R with TraMineR, in: Journal of Statistical Software, Vol. 40, No. 4, pp. 1-37.
  • Lesnard, Laurent (2010): Setting Cost in Optimal Matching to Uncover Contemporaneous Socio-Temporal Patterns, in: Sociological Methods and Research, Vol. 38, No. 3, pp. 389-419.
  • Studer, Matthias, Gilbert Ritschard, Alexis Gabadinho and Nicolas S. Müller (2011): Discrepancy Analysis of State Sequences, in: Sociological Methods and Research. Vol. 40, No. 3, pp. 471-510.

[1] See Studer et al. (2011) for a presentation of discrepancy analysis that is an ANOVA like approach for distance matrices.

Best Answer

I assume that you are using sampling weights to correct for representativity bias. Please note that some "data providers" require you to use the weights in your publications.

In my opinion, you should always use weights for descriptive analysis in order to get unbiased results. I think that there are more consensus for this kind of analysis. Descriptive analysis includes cluster analysis, sequences visualization, computation of transitions rates (and hence substitution costs based on them), for instance. For weighted cluster analysis, you can have a look at the WeightedCluster library and manual.

Regarding the weights to use, I would recommend to use longitudinal weights, since the sequences are defined for the whole period, but it depends on the exact weight definition. For a more general answer, you need to answer the following questions:

  • What sample do I have (at what time, and so on)?
  • to which population do I want to generalize?

In some panels, longitudinal weights use the sample defined by wave t and generalize it to the population at wave one. This is what you want if you want to follow the evolution at wave one.

Related Question