Solved – Which weighting factor to use for text categorization

data miningmachine learningtext miningtf-idf

I am working on a text categorization task, and I possess 21,000 documents for training, and (for the time being), 7000 documents for testing. I construct the doc-term matrix for both training corpus and testing corpus, with two different weighting factors, i.e. TF (term frequency) or TF-IDF (term frequency–inverse document frequency). Then I used SVM with Gaussian Radial kernel, to classify the documents. The F1-measure for tf-idf weighting is nearly 0.8, while for tf weighting, the performance is less better, around 0.7. So, logically, We'll be prone to use tf-idf weighting.

However, one problem will come out in an INCREMENTAL CONTEXT. That is, when we have to categorize a single or a few new documents from time to time (with a pre-trained model). It will not be suitable to use tf-idf weighting for one single document, since tf-idf is often used to measure a word importance in a collection of documents.

Should I compromise to use tf weighting or some other tricks exist?

Best Answer

Have you thought about storing the term frequency and the document frequency separately for the trained set, then when you add a new document you can update the document frequency of the new trained set, i.e. including the new document, and calculate tf-idf. Or am I missing something?