Text Mining – Understanding Incremental IDF (Inverse Document Frequency)

text miningtime series

In a text mining application, one simple approach is to use the $tf-idf$ heuristic to create vectors as compact sparse representations of the documents. This is fine for the batch setting, where the whole corpus is known a-priori, as the $idf$ requires the whole corpus

$$
\mathrm{idf}(t) = \log \frac{|D|}{|\{d: t \in d\}|}
$$

where $t$ is a term, $d$ is a document, $D$ is the document corpus, and $T$ (not shown) is the dictionary.

However typically new documents are received over time. One option is to keep using the existing $idf$ until a certain number of new documents have been received, and the recalculate it. However this seems rather inefficient. Does anyone know of an incremental update scheme that (possibly approximately) converges to the value if all the data were seen in advance? Or alternatively is there another measure which captures the same notion but can be computed in an incremental fashion?

There is also a related question of whether the $idf$ remains a good measure over time. Since the idf captures the notion of the corpus word frequency, it is conceivable that older documents in the corpus (say for example, that my corpus includes over 100 years of journal articles), as the frequencies of different words change over time. In this case it might actually be sensible to throw out older documents when new ones come in, in effect using a sliding window $idf$. Conceivably, one could also store all previous $idf$ vectors as new ones are calculated, and then if we wanted to retrieve documents from say 1920-1930, we could use the $idf$ calculated from documents in that date range. Does this approach make sense?

Edit: There is a separate but related issue about the dictionary $T$. As time evolves, there will be new dictionary terms that didn't appear before, so $|T|$ will need to grow, and hence the length of the $idf$ vector. It seems like this wouldn't be a problem, as zeros could be appended to old $idf$ vectors.

Best Answer

Ok, Thanks to Steffen for the useful comments. I guess the answer is quite simple in the end. As he says, all we need to do is store current denominator (call it $z$):

$z(t) = |\{d:t\in d\}|$

Now given a new document $d^*$, we update the denominator simply by:

$z^*(t) = z(t) + \left\{ \begin{array}{ll} 1 & \mbox{if}\; {t\in d^*} \\ 0 & \mbox{otherwise} \end{array} \right.$

We would then have to recalculate the $tf-idf$ based on the new $idf$ vector.

Similarly to remove an old document, we decrement the numerator in a similar fashion.

This does mean that we either have to store the entire $tf$ matrix as well as the $tf-idf$ matrix (doubling the memory requirements), or we have to compute the $tf-idf$ scores when needed (increasing computational costs). I can't see any way round that.

For the second part of the question, about the evolution of $idf$ vectors over time, it seems that we can use the above method, and store a set of "landmark" $z$ vectors (denominators) for different date ranges (or perhaps content subsets). Of course $z$ is a dense vector of the length of the dictionary so storing a lot of these will be memory intensive; however this is probably preferable to recomputing $idf$ vectors when needed (which would again require storing the $tf$ matrix as well or instead).

Related Solutions

Solved – Which weighting factor to use for text categorization

Have you thought about storing the term frequency and the document frequency separately for the trained set, then when you add a new document you can update the document frequency of the new trained set, i.e. including the new document, and calculate tf-idf. Or am I missing something?

Solved – Search in TF-IDF

You can make use of sklearn.feature_extraction.text.TfidfVectorizer

A simple example:

from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(min_df=1)

my_phrases = ["boring answer phrase",
              "exciting phrase",
              "phrase on stackoverflow",
              "answer on stackoverflow"]

my_features = vectorizer.fit_transform(my_phrases)

Result:

>>> import numpy as np
>>> np.set_printoptions(precision=4)
>>> my_features.A
array([[ 0.5535,  0.702 ,  0.    ,  0.    ,  0.4481,  0.    ],
       [ 0.    ,  0.    ,  0.8429,  0.    ,  0.538 ,  0.    ],
       [ 0.    ,  0.    ,  0.    ,  0.6137,  0.4968,  0.6137],
       [ 0.5774,  0.    ,  0.    ,  0.5774,  0.    ,  0.5774]])
>>> vectorizer.get_feature_names()
[u'answer', u'boring', u'exciting', u'on', u'phrase', u'stackoverflow']

As a side note, you can remove "stop words" like "on", by passing stop_words='english' parameter:

vectorizer = TfidfVectorizer(min_df=1, stop_words='english')

Edit:

from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np

# each phrase here could be document in your list 
# of documents
my_phrases = ["boring answer phrase",
              "exciting phrase",
              "phrase on stackoverflow",
              "answer on stackoverflow"]

#  and you want to find the most similar document
#  to this document             
phrase = ["stackoverflow answer"]

# You could do it like this:
vectorizer = TfidfVectorizer(min_df=1, stop_words='english')
all_phrases = phrase + my_phrases
my_features = vectorizer.fit_transform(all_phrases)
scores = (my_features[0, :] * my_features[1:, :].T).A[0]
best_score = np.argmax(scores)
answer = my_phrases[best_score]

Result:

>>> answer
'answer on stackoverflow'

Best Answer

Related Solutions

Solved – Which weighting factor to use for text categorization

Solved – Search in TF-IDF

Related Question