Solved – K-means on cosine similarities vs. Euclidean distance (LSA)

cosine distancecosine similarityk-meanslatent-semantic-analysissvd

I am using latent semantic analysis to represent a corpus of documents in lower dimensional space. I want to cluster these documents into two groups using k-means.

Several years ago, I did this using Python's gensim and writing my own k-means algorithm. I determined the cluster centroids using Euclidean distance, but then clustered each document based on cosine similarity to the centroid. It seemed to work pretty well.

Now I am trying to do this on a much larger corpus of documents. K-means is not converging, and I'm wondering if it's a bug in my code. I read recently that you shouldn't cluster using cosine similarity, because k-means only works on Euclidean distance. Even though, as I mentioned, it appeared to work fine in my smaller test case.

Now I come across this on the LSA Wikipedia page:

Documents and term vector representations can be clustered using traditional clustering algorithms like k-means using similarity measures like cosine.

So which is it? Can I use cosine similarity or not?

Best Answer

Yes, you can use it. The problem is, that the cosine similarity is not a distance, that is why it is called similarity. Nevertheless, it can be converted to a distance as explained here.

In fact, you can just use any distance. A very nice study of the properties of distance functions in high dimensional spaces (like it is usually the case in information retrieval) is On the Surprising Behavior of Distance Metrics in High Dimensional Space. It does not compare Euclidean vs. cosine though.

I came across with this study where they claim that in high dimensional spaces, both distances tend to behave similarly.