I am assessing the similarity between documents represented as vectors of tf-idf values. I know that the cosine similarity is a well-defined and commonly used measure in information retrieval.
However I am thinking how to deal with the case where 2 documents are "linearly dependent".
Let's consider three documents:
$d_1 = [1,1,3,4] $
$d_2 = [1,1,2,4] $
$d_3 = [10,10,30,40] $
At first sight, $d_1$ and $d_2$ should be more similar than $d_1$ and $d_3$.
However, using the cosine similarity:
import numpy as np
def cos_sim(d1,d2):
return np.dot(d1,d2)/ (np.sqrt(np.dot(d1,d1))* np.sqrt(np.dot(d2,d2)))
I got:
cos_sim(d1,d2) = 0.98473192783466179
cos_sim(d1,d3) = 1.0
How should we deal with this case?
Best Answer
Recent Article have shown that "ts-ss" is better than cosine or euclidean
because
Reference paper is "A Hybrid Geometric Approach for Measuring Similarity Level Among Documents and Document Clustering"
If you want to see a summary of the paper, please refer to the github below.
https://github.com/taki0112/Vector_Similarity
thank you