Solved – Cosine Similarity Intuition

cosine distancecosine similaritynatural languagetext miningtf-idf

I understand what cosine similarity is and how to calculate it, specifically in the context of text mining (i.e. comparing tf-idf document vectors to find similar documents). What I'm looking for is some better intuition for interpreting the results/similarity scores I come up with.

My question: If I have a cosine similarity of less than 0.707 (i.e. an angle greater than 45 degrees), is is fair to say that those respective documents/vectors are more "different" (less "similar") since the angle between them is more orthogonal? My initial thought was 'yes,' but in practice for me so far it doesn't seem like that's the right way to read into the numbers.

Best Answer

I believe another difference between cosine similarity and TF-IDF is that cosine similarity is done in an embedding space, such as one created by doc2vec.

Such an embedding puts words that are used in similar contexts near to each other, so you could use clustering to find similar documents. But cosine distance probably makes more sense for a couple of reasons:

  1. An embedding like doc2vec encodes information in direction and distance. Look at the examples of king - man + woman yielding queen. I'd guess that direction dominates this comparison.

  2. In high-dimensional spaces, "nearby" (distance) can begin to lose its meaning, so directional measures -- which are also by definition finite and determined a priori -- might make more sense if the "inner product space" supports it. (I threw the last part in there not totally understanding what an "inner product space" is, but it sounds cool and it is related... I just couldn't explain how.)

So, given that, I'd say that the idea of "orthogonality" isn't meaningful here. Two documents are either together in a smaller wedge of the space or a larger wedge of the space and that's that: 100 degrees apart is farther apart than 90 degrees, and 80 degrees apart is closer.

Related Question