Solved – Drawbacks with Cosine Similarity

cosine similarityinformation retrieval

I am assessing the similarity between documents represented as vectors of tf-idf values. I know that the cosine similarity is a well-defined and commonly used measure in information retrieval.

However I am thinking how to deal with the case where 2 documents are "linearly dependent".

Let's consider three documents:

$d_1 = [1,1,3,4] $

$d_2 = [1,1,2,4] $

$d_3 = [10,10,30,40] $

At first sight, $d_1$ and $d_2$ should be more similar than $d_1$ and $d_3$.

However, using the cosine similarity:

import numpy as np
def cos_sim(d1,d2):
    return np.dot(d1,d2)/ (np.sqrt(np.dot(d1,d1))* np.sqrt(np.dot(d2,d2)))

I got:

cos_sim(d1,d2) = 0.98473192783466179

cos_sim(d1,d3) = 1.0

How should we deal with this case?

Best Answer

Recent Article have shown that "ts-ss" is better than cosine or euclidean

because Cosine, Euclidean drawbacks

Reference paper is "A Hybrid Geometric Approach for Measuring Similarity Level Among Documents and Document Clustering"

If you want to see a summary of the paper, please refer to the github below.

https://github.com/taki0112/Vector_Similarity

thank you

Related Question