Solved – TF-IDF versus Cosine Similarity in Document Search

cosine distancecosine similaritymachine learningrankingsimilarities

I'm wondering if anyone can help me out or point out some resources to learn more about TF-IDF and document search.

I'm trying to implement a basic document search and am trying to better understand the differences and trade offs for my approach.

My current approach is to parse/stem all words in a set of documents and compute a normalized TF-IDF value for each document-word pair. When I query with keywords, I simply look for each word in the keyword, sum the TF-IDF values for each document-word, and rank them that way.

Are there any trade offs/differences/mistakes in using this approach? How does it compare to creating a vector for each document, creating a vector for the search query, and taking the cosine similarity to find the closest matches?

Best Answer

Xeon is right in what TF-IDF and cosine similarity are two different things. TF-IDF will give you a representation for a given term in a document. Cosine similarity will give you a score for two different documents that share the same representation. However, "one of the simplest ranking functions is computed by summing the tf–idf for each query term". This solution is biased towards long documents where more of your terms will appear (e.g., Encyclopedia Britannica). Also, there are much more advance approaches based on a similar idea (most notably Okapi BM25).

In general, you should use the cosine similarity if you are comparing elements with the same nature (e.g., documents vs documents) or when you need the score itself to have some meaningful value. In the case of cosine similarity, a 1.0 means that the two elements are exactly the same based on their representation. I would recommend these resources to know more about the topic:

Modern Information Retrieval, by Ricardo Baeza-Yates et al.,
Introduction to Information Retrieval, by Christopher Manning et al.