Solved – Cosine Similarity Intuition

cosine distancecosine similaritynatural languagetext miningtf-idf

I understand what cosine similarity is and how to calculate it, specifically in the context of text mining (i.e. comparing tf-idf document vectors to find similar documents). What I'm looking for is some better intuition for interpreting the results/similarity scores I come up with.

My question: If I have a cosine similarity of less than 0.707 (i.e. an angle greater than 45 degrees), is is fair to say that those respective documents/vectors are more "different" (less "similar") since the angle between them is more orthogonal? My initial thought was 'yes,' but in practice for me so far it doesn't seem like that's the right way to read into the numbers.

Best Answer

I believe another difference between cosine similarity and TF-IDF is that cosine similarity is done in an embedding space, such as one created by doc2vec.

Such an embedding puts words that are used in similar contexts near to each other, so you could use clustering to find similar documents. But cosine distance probably makes more sense for a couple of reasons:

An embedding like doc2vec encodes information in direction and distance. Look at the examples of king - man + woman yielding queen. I'd guess that direction dominates this comparison.
In high-dimensional spaces, "nearby" (distance) can begin to lose its meaning, so directional measures -- which are also by definition finite and determined a priori -- might make more sense if the "inner product space" supports it. (I threw the last part in there not totally understanding what an "inner product space" is, but it sounds cool and it is related... I just couldn't explain how.)

So, given that, I'd say that the idea of "orthogonality" isn't meaningful here. Two documents are either together in a smaller wedge of the space or a larger wedge of the space and that's that: 100 degrees apart is farther apart than 90 degrees, and 80 degrees apart is closer.

Related Solutions

Solved – Cosine angle calculation for the documents – Dissimilarity function not working in tm package in R

You can try using an alternative package, quanteda, that has a function called similarity(). It's still a bit rough around the edges but works as advertised.

require(quanteda)
# clean the texts and create a document-feature matrix
myDfm <- dfm(inaugCorpus, verbose = FALSE)
# similarity matrix for cosine
similarity(myDfm, docnames(myDfm), margin = "documents", method = "cosine")

The last command produces a list of numeric vectors of cosine similarity where each list element corresponds to a document (supplied by docnames(myDfm)). If you wanted to focus on just a single document, you can name just one.

To load the files you want, you can use (might need adjustment for your example):

myCorpus <- corpus(textfile("data/txt/*txt"))

Solved – Interpreting negative cosine similarity

Let two vectors $a$ and $b$, the angle $θ$ is obtained by the scalar product and the norm of the vectors :

$$ cos(\theta) = \frac{a \cdot b}{||a|| \cdot ||b||} $$

Since the $cos(\theta)$ value is in the range $[-1,1]$ :

$-1$ value will indicate strongly opposite vectors
$0$ independent (orthogonal) vectors
$1$ similar (positive co-linear) vectors. Intermediate values are used to assess the degree of similarity.

Example : Let two user $U_1$ and $U_2$, and $sim(U_1, U_2)$ the similarity between these two users according to their taste for movies:

$sim(U_1, U_2) = 1$ if the two users have exactly the same taste (or if $U_1 = U_2$)
$sim(U_1, U_2) = 0$ if we do not find any correlation between the two users, e.g. if they have not seen any common movies
$sim(U_1, U_2) = -1$ if users have opposed tastes, e.g. if they rated the same movies in an opposite way

Best Answer

Related Solutions

Solved – Cosine angle calculation for the documents – Dissimilarity function not working in tm package in R

Solved – Interpreting negative cosine similarity

Related Question