I have a time-series dataset and I am required to find similar clusters in the data.
Based on my current knowledge and the requirements of my application, I used SBD
measure (shape based distance) to calculate the dissimilarity matrix for my dataset and applied hierarchical clustering on it (using tsclust
).
The R command used is:
library(dtwclust)
hclust=tsclust(mydata,type="h", distance = "sbd")
I also used cvi
for cluster validation (cvi(hclust)
) and was able to get a value of 0.508 for Silhouette width (which I believe is good enough). The problem is that I don't know at which point to cut this cluster tree – for how many clusters (value of k
) or at what height (value of h
) to get the Silhouette width of 0.5?
Moreover, once I know this value of k
or h
, how do I find the centroids (time-series data) that represent these clusters?
Best Answer
The answer was pretty straight-forward - thanks to Alexis who suggested me to read Appendix-A of the documentation, and @Haroon who suggested writing an email to Alexis. Here are the answers to my questions:
1) How to identify the value of k or h for which the Silhouette value was 0.5?
I ran a loop for different number of clusters and cut the dendrogram for these many clusters. Then, I computed the silhouette value for each of the cluster sets and identified the value for k where the value was 0.5. The code for this is:
2) How to identify cluster centroids? Once we have identified the number of clusters, we need to find the centroids for these clusters. Please note that the object returned by
tsclust()
is of type S4 (refer to the documentation ofTSClusters-Class
) and to access its formal elements, we need@
operator unlike the$
operator often used in R. When we know the value ofk
for our dendrogram, we can use this value intsclust()
again for better clustering.Hope it helps someone!