Solved – How to use TF-IDF for features selection in Text classification

text mining

I have a small confusion regarding TFIDF. I am planning to use TFIDF for creating better word dictionary to be used in Naive Bayes classifier. I am calculating the TDIDF of all words in respective class to find the importance of a given word in respective class . In my case it is subjective and objective. Based on cutoff TDIDF I am planning to create a better word dictionary.

Here my confusion arises . Should I use TFIDF in the same way or use it to find if a word belong to which class?

Best Answer

Unfortunately, there is no set answer - you have to try what works (start with whatever's easiest) for your given problem. What works can also vary by topic.

My favorite example of this Joachims 98

They are comparing algorithms and average across several feature solutions, but my point is if you look at Figure 2, Naive Bayes works really well for some topics and really poorly for others.

I generally start by taking the top 20% by TF-IDF across all classes, and use Naive Bayes to get a baseline of performance for each class. This is quick, and often all I need in my domain, insurance. Then you may want to dig deeper on any classes that perform poorly - like you said maybe do TF-IDF within the class and look at the terms that you can leverage.

Looking at the terms by class can really help - one time I noticed medical terms were important in one particular class - I downloaded a list of medical terms, turned it into a regular expression, and used it to set a flag on the documents which really improved classification on that class.

Obviously, this is very domain/topic specific, and also depends on your domain expertise. That's the way it goes with text classification in my experience - there is no standard answer for what will work for any given problem. You may have to try several solutions and you stop when performance is adequate.