Solved – Do low silhouette widths mean the data has little underlying structure

clusteringtraminer

I am new to sequence analysis, and I was wondering how you react if the average silhouette widths (ASW) from cluster analyses of Optimal Matching-based dissimilarity matrices are low (around.25). Would it seem appropriate to conclude that there is little underlying structure that would allow the sequences to be clustered? Might you ignore the low ASW based on other measures of cluster quality (I have pasted some below)? Or is it likely that choices made during the sequence analysis or subsequent cluster analyses might be responsible for the low ASW numbers?

Any suggestions would be appreciated. Thanks.

In case more context is needed:

I am examining 624 sequences of work hour mismatches (i.e., mismatches between the number of hours a person prefers to work in a week and the number of hours they actual work) among people in their 20s. All the sequences I am examining have a length of 10. My sequence object has five states (M=wants more hours, S=wants the same hours, F=wants fewer hours, O=out of the labor force, and U=unemployed).

I have not done a systematic accounting of how ASW results vary with different combinations of approaches. Still, I have tried low and medium indel costs (.1 and .6 of the max substitution cost–I care more about the order of events than their timing) and different clustering procedures (ward, average, and pam). My overall impression is that the ASW numbers remain low.

Perhaps low ASW results make sense. I would expect these states to come in a variety of different orders, and the states can be repeated. Removing duplicate observations only lowers the N from 624 to 536. Studying the data reveals that there is indeed a good bit of variety and sequences that I would consider very different e.g., people who wanted the same hours the entire time, developed a mismatch, resolved a mismatch, and oscillated back and forth between having and not having a mismatch. Perhaps lack of clearly differentiated clusters is not the same thing as a lack of interesting variation. Still, the weak cluster results seem to leave me without a nice way to summarize the sequences.

Results from Ward's method with indel set at .1 of the substitution cost of 2
These statistics seem to suggest a 6 cluster solution might be good.
The ASW, however, is low– at least for solutions that have a reasonable number of clusters (2 or 3 is too few).

           PBC   HG HGSD  ASW ASWw     CH   R2   CHsq R2sq   HC
cluster2  0.56 0.78 0.75 0.38 0.38 110.76 0.15 241.65 0.28 0.14
cluster3  0.51 0.68 0.65 0.27 0.27 108.10 0.26 237.60 0.43 0.17
cluster4  0.54 0.74 0.71 0.25 0.25  88.66 0.30 203.72 0.50 0.14
cluster5  0.59 0.83 0.79 0.25 0.25  75.85 0.33 183.21 0.54 0.09
cluster6  0.59 0.85 0.82 0.24 0.25  66.94 0.35 164.51 0.57 0.08
cluster7  0.47 0.79 0.75 0.18 0.19  64.09 0.38 154.47 0.60 0.12
cluster8  0.47 0.81 0.77 0.20 0.21  59.47 0.40 152.36 0.63 0.11
cluster9  0.48 0.84 0.80 0.19 0.21  56.68 0.42 147.83 0.66 0.10
cluster10 0.47 0.86 0.82 0.19 0.21  53.24 0.44 140.18 0.67 0.08

Best Answer

The ASW is a measure of the coherence of a clustering solution. A high ASW value means that the clusters are homogeneous (all observations are close to cluster center), and that they are well separated. According to Kaufmann and Rousseuw (1990), a value below 0.25 means that the data are not structured. Between 0.25 and 0.5, the data might be structured, but it might also be an artifice. Please keep in mind that these values are indicative and should not be used as a decision threshold. These values are not theoretically defined (there are not based on some p-value) but are based on the experience of the authors. Hence, according to these low ASW values, your data seems to be quite unstructured. If the purpose of the cluster analysis is only descriptive, then you can argue that it reveals some (but only some) of the most salient patterns. However, I think that in your case, you should not draw any theoretical conclusions from your clustering.

You can also try to have a look at the "per cluster" ASW values (this is given by the function wcClusterQuality). Maybe some of your clusters are well-defined and some may be "spurious" (ASW<0), resulting in a low overall ASW value.

You can try to use bootstrap strategies, which should give you a better hint. In R, the function clusterboot in the package fpc can be used for this purpose (look at the help page). However, it does not work with weighted data. If your data are unweighted, I think it is worth to give it a try.

Finally, you may want to have a closer look at your data and your categorization. Maybe, your categories are too instable or not well defined. However, it does not seem to be the case here.

As you have said, "lack of clearly differentiated clusters is not the same thing as a lack of interesting variation". There are other methods to analyse the variability of your sequences such as discrepancy analysis. These methods allow you to study the links between sequences and explanatory factors. You may, for instance, try to build sequence regression trees (function "seqtree" in package TraMineR).

Related Question