Solved – Understanding the use of logarithms in the TF-IDF logarithm

clusteringmachine learningmathematical-statisticsnatural languagetext mining

I was reading:

https://en.wikipedia.org/wiki/Tf%E2%80%93idf#Definition

But I cannot seem to understand exactly why the formula was constructed the way it is.

What I do Understand:

iDF should at some level measure how frequently a term S appears in each of the documents, decreasing in value as the term appears more frequently.

From that perspective

$$ iDF(S) = \frac{\# \text{ of Documents}}{\# \text{ of Documents containing S}}$$

Furthermore term frequency can be rightly described as

$$ tf(S,D) = \frac{\# \ \text{of Occurrences of S in document D}}{\# \ \text{maximum number of occurrences for any string Q in document D}} $$

So then the measure

$$ iDF(S) \times tf(S,D) $$

is in some way proportional to how frequently a term appears in a given document, and how unique that term is over the set of documents.

What I don't understand

But the formula given describes it as

$$ \left( \log(iDF(S)) \right) \left( \frac{1}{2} + \log(\frac{1}{2} tf(S,D)) \right) $$

I wish to understand the need for the logarithms described in the definition. Like, why are they there? What aspect do they emphasize?

Best Answer

The aspect emphasised is that the relevance of a term or a document does not increase proportionally with term (or document) frequency. Using a sub-linear function therefore helps dampen down this effect. To that extent, the influence of very large or very small values (e.g. very rare words) is also amortised. Finally, as most people intuitively perceive scoring functions to be somewhat additive, using logarithms will make probability of different independent terms from $P(A, B) = P(A) \, P(B)$ to look more like $\log(P(A,B)) = \log(P(A)) + \log(P(B))$.

As the Wikipedia article you link notes the justification of TF-IDF is still not well-established; it is/was a heuristic that we want to make rigorous, not a rigorous concept we want to transfer to the real world. As mentioned by @Anony-Mousse as a very good read on the matter is Robertson's Understanding Inverse Document Frequency: On theoretical arguments for IDF. It gives a broad overview of the whole framework and attempts to ground TF-IDF methodology to the relevance weighting of search terms.