Solved – A question on cosine similarity & k-means

clusteringcosine distancecosine similaritydistancek-means

I used the following code to perform clustering of a dataset in R.

distMatrix1 <- dist(sample2, method="cosine")
km<-kmeans(distMatrix1,3)

I have got some questions, them being

  1. When the distance matrix is created it is an N*N matrix, so is the average of each row fed to kmeans function in R.
  2. How are the cluster centroids calculated in this case, does clustering happen using Euclidean distances or does it happen using cosine dot product formula?
  3. What is the significance of the clusters obtained? Do the entities which lie in the same cluster behave similarly?

Best Answer

What you're doing here is clustering (with a KMeans) the feature vector made of the similarity with the other points of your dataset. More precisely you have your data sample2 which is of size say $N \times P$ where $P$ is the dimension of your data and $N$ the number of samples. You compute the similarit of each point to the others with a cosine method, meaning you create a matrix $M$ of size $N \times N$ made of the $m_{i,j} = <x,y> = \sum_{p = 1}^P x_i y_i$ (assuming it's been normalized before for the sake of simplicity). Then you feed your KMeans with this data, it means that you cluster your data not considering its feature vector is the line of sample2 anymore but the line of your similarity matrix: your feature vector is the interaction of the current point with all the others.

The clusters will hence be points of a $N$-dimensional space. But it won't change the fact that the metric that is used in KMeans is euclidean: you'll find the clusters (for the euclidean distance) in this space of dimension $N$ made of the cosine-similarity of each point with the others. Clusters do have the same interpretation as any clusters except that you changed your input data.

What you did is the first step to very well known methods that are called spectral clustering. You'll find plenty of information about this online. Note that there are interesting results about doing what you're doing which relate clustering the similarity-feature vectors with a min-cut algorithm over the similarity graph. I advice you to read this this paper to begin with. Then this one is a fundamental paper!.