Solved – Why add one in inverse document frequency

natural languagesmoothingtext mining

My textbook lists the idf as $log(1+\frac{N}{n_t})$ where

  • $N$: Number of Documents
  • $n_t$: Number of Documents containing term $t$

Wikipedia lists this formula as a smoothed version of the actual $log(\frac{N}{n_t})$. That one I understand: it ranges from $log(\frac{N}{N})=0$ to $\infty$ which seems intuitive.
But $log(1+\frac{N}{n_t})$ goes from $log(1+1)$ to $\infty$ which seems so odd…
I know a little about smoothing from language modelling but there you would add something in the numerator as well as in the denominator because you are worried about the probability mass. But just adding $1$ doesn't make sense to me. What are we trying to accomplish here?

Best Answer

As you will see pointed out elsewhere that tf-idf is discussed, there is no universally agreed single formula for computing tf-idf or even (as in your question) idf. The purpose of the $+ 1$ is to accomplish one of two objectives: a) to avoid division by zero, as when a term appears in no documents, even though this would not happen in a strictly "bag of words" approach, or b) to set a lower bound to avoid a term being given a zero weight just because it appeared in all documents.

I've actually never seen the formulation $log(1+\frac{N}{n_t})$, although you mention a textbook. But the purpose would be to set a lower bound of $log(2)$ rather than zero, as you correctly interpret. I have seen 1 + $log(\frac{N}{n_t})$, which sets a lower bound of 1. The most commonly used computation seems to be $log(\frac{N}{n_t})$, as in Manning, Christopher D, Prabhakar Raghavan, and Hinrich Schütze (2008) Introduction to Information Retrieval, Cambridge University Press, p118 or Wikipedia (based on similar sources).

Not directly relevant to your query, but the upper bound is not $\infty$, but rather $k + log(N/s)$ where $k, s \in {0, 1}$ depending on your smoothing formulation. This happens for terms that appear in 0 or 1 documents (again, depends on whether you smooth with $s$ to make it defined for terms with zero document frequency - if not then the max value occurs for terms that appear in just one document). IDF $\rightarrow \infty$ when $1 + n_t=1$ and $N \rightarrow \infty$.