Solved – Which weighting factor to use for text categorization

I am working on a text categorization task, and I possess 21,000 documents for training, and (for the time being), 7000 documents for testing. I construct the doc-term matrix for both training corpus and testing corpus, with two different weighting factors, i.e. TF (term frequency) or TF-IDF (term frequency–inverse document frequency). Then I used SVM with Gaussian Radial kernel, to classify the documents. The F1-measure for tf-idf weighting is nearly 0.8, while for tf weighting, the performance is less better, around 0.7. So, logically, We'll be prone to use tf-idf weighting.

However, one problem will come out in an INCREMENTAL CONTEXT. That is, when we have to categorize a single or a few new documents from time to time (with a pre-trained model). It will not be suitable to use tf-idf weighting for one single document, since tf-idf is often used to measure a word importance in a collection of documents.

Should I compromise to use tf weighting or some other tricks exist?

Solved – Which weighting factor to use for text categorization

Best Answer

Related Question

Best Answer

Related Solutions

Text Mining – Understanding Incremental IDF (Inverse Document Frequency)

Solved – K-Fold Cross validation and F1 Measure Score for Document Retrieval using TF-IDF weighting and some customised weighting schemes

Related Question