Solved – Where did sublinear tf-idf originate

information retrievalreferencestext miningweights

I have often come across this weighting scheme for tf-idf (term frequency – inverse document frequency) in text mining. I am wondering where it came from (for citations). I've searched very rigorously, but can't seem to find anything. Specifically, this is the weighting scheme:

$$
{\rm tfidf}(t,d,D)=(1+\log({f_{t,d})})\cdot \log\!\bigg(1+\frac{N}{n_t}\bigg) \\
N=|D|
$$

where $t$ is the query term, $d$ is the document, $D$ is set of documents, $n_t$ is the document frequency of $t$, $f_{t,d}$ is how many times $t$ appears in $d$

Best Answer

I was also looking for a reference to justify my use of sublinear tf. I couldn't find where it originated from, but if you just need a reference you can use An introduction to information retrieval, 2009, C.D. Manning et al. Section 6.4 is on variant tf-idf functions that includes sublinear tf scaling.