Solved – tf-idf in multi-label classification task

classificationmachine learning

I have a question regarding application of tf-idf.

Let's assume I have a document classification task, there is a training set of documents that are multi-labeled, such that one document can have multiple labels. I use bigrams and unigrams as a features and tf-idf as features values.

The question is, how to calculate td-idf values.

For example I have document 1 that classified as class1 and class2.

tf should be just the frequency of feature f in document 1 or frequency of feature f in all documents of type class1 and class2 or frequency of feature f in the entire corpus?

The same question regarding idf. Should I consider the class of dicument when calculating it?

Best Answer

No, you don't have to consider the document class when calculating IDF. TF.IDF can be used completely independent of the class labels.

In other words the TF term is just the frequency of a particular unigram/bigram in the document. The IDF is the inverse of the frequency of a particular unigram/bigram in the whole corpus.