Solved – Feature selection for the text mining

feature selectionmachine learningtext mining

Before performing the task of text mining, we need to select the features for characterizing each given document. Are there any systematic guidance on choosing the document features? How does the length of the document affect the feature selection process for the documents?

Best Answer

There are a lot of ways for doing feature selection. I'll give you my "system" that works in my domain (insurance).

The first thing I usually try is calculating TF-IDF and taking the top 5-20% depending on how many features there are. To be honest, I don't worry too much about document length at this stage (or more specifically - the difference in lengths between documents) - I feel the TF portion of TF-IDF accounts for that, and IDF essentially creates a "corpus specific" stopword list. But I'm also not comparing Tweets to say War and Peace either.

I then take these features and use Naive Bayes to classify - I consider this to be my "baseline", but it's often good enough - especially since I frequently only need to quantify text for combination with more traditional structured data.

Hope this enough to get you started.