Solved – R: how to find cluster centroids with tsclust

clusteringhierarchical clusteringrtime series

I have a time-series dataset and I am required to find similar clusters in the data.

Based on my current knowledge and the requirements of my application, I used SBD measure (shape based distance) to calculate the dissimilarity matrix for my dataset and applied hierarchical clustering on it (using tsclust).
The R command used is:

library(dtwclust)    
hclust=tsclust(mydata,type="h", distance = "sbd")

I also used cvi for cluster validation (cvi(hclust)) and was able to get a value of 0.508 for Silhouette width (which I believe is good enough). The problem is that I don't know at which point to cut this cluster tree – for how many clusters (value of k) or at what height (value of h) to get the Silhouette width of 0.5?

Moreover, once I know this value of k or h, how do I find the centroids (time-series data) that represent these clusters?

Best Answer

The answer was pretty straight-forward - thanks to Alexis who suggested me to read Appendix-A of the documentation, and @Haroon who suggested writing an email to Alexis. Here are the answers to my questions:

1) How to identify the value of k or h for which the Silhouette value was 0.5?

I ran a loop for different number of clusters and cut the dendrogram for these many clusters. Then, I computed the silhouette value for each of the cluster sets and identified the value for k where the value was 0.5. The code for this is:

library(dtwclust)
hclust=tsclust(mydata,type='h',distance='sbd')

#running the loop now
numberofclusters=c(2:100)
silValues=c(1:length(numberofclusters))    
for(size in numberofclusters){
  sbd_cluster=cutree(hclust,k=size)
  index=which(numberofclusters==size)
  x<-silhouette(sbd_cluster,dist=sbddist)
  silSBDclust[index]=mean(x[,3])
}
plot(clustersizes,silSBDclust,xlab="Number of clusters",ylab="Average silhouette width")

2) How to identify cluster centroids? Once we have identified the number of clusters, we need to find the centroids for these clusters. Please note that the object returned by tsclust() is of type S4 (refer to the documentation of TSClusters-Class) and to access its formal elements, we need @ operator unlike the $ operator often used in R. When we know the value of k for our dendrogram, we can use this value in tsclust() again for better clustering.

hclust=tsclust(mydata,k=k,type='h',distance='sbd')
View(hclust@centroids)   #gives you a list of centroids

Hope it helps someone!

Related Question