Solved – How to continuously compute tf-idf for relevance of single terms

text miningtf-idf

I have a document corpus containing over 4 million documents. Now I want to build an index over terms from the documents of the corpus. Based on the tf-idf of these terms, I want to remove the least important terms every 10.000 documents or so. Since tf-idf is a measure on document level but with the implication of the whole corpus, I'm not sure on how to continuously update it. Thus far, I'm computing it based on this formula:

tf-idf_continuous = (current_tf-idf * (currentNumberOfArticlesContaining_i
-1) + tf_ij * log(N)) / currentNumberOfArticlesContaining_i;

with term i, termfrequency of i in document j (tf_ij), N = number of documents in corpus. So, I'm calculating some sort of mean tf-idf. However, I don't think this is a good approach based on the results I get. However, I don't have too much computation power for building the whole index before calculating tf-idf for all instances.

Best Answer

I would propose the following procedure. For each chunk of 10k:

  1. Calculate word frequencies for each text
  2. If the corpus document frequency (df) component does not exist, initialize by using all text word frequencies. Else, update with the counts from the chunk + transforms necessary. You can handle new words by adding in "zero" columns to the old chunks.
  3. Recalculate tfidf for all processed chunks by taking tf and dividing by idf.

Does that work for you? Normal considerations about trimming sparse words, etc. apply.