Solved – Feature selection for the text mining

feature selectionmachine learningtext mining

Before performing the task of text mining, we need to select the features for characterizing each given document. Are there any systematic guidance on choosing the document features? How does the length of the document affect the feature selection process for the documents?

Best Answer

There are a lot of ways for doing feature selection. I'll give you my "system" that works in my domain (insurance).

The first thing I usually try is calculating TF-IDF and taking the top 5-20% depending on how many features there are. To be honest, I don't worry too much about document length at this stage (or more specifically - the difference in lengths between documents) - I feel the TF portion of TF-IDF accounts for that, and IDF essentially creates a "corpus specific" stopword list. But I'm also not comparing Tweets to say War and Peace either.

I then take these features and use Naive Bayes to classify - I consider this to be my "baseline", but it's often good enough - especially since I frequently only need to quantify text for combination with more traditional structured data.

Hope this enough to get you started.

Related Solutions

Bigram (N-gram) Model – Using Bigram Model to Build Feature Vector for Text Document

Yes. That will generate many more features though: it might be important to apply some cut-off (for instance discard features such bi-grams or words that occur less than 5 times in your dataset) so as to not drown your classifier with too many noisy features.

Solved – Feature selection methods for document classtification

Introduction to Information Retrieval book contains some relevant material.

If python is your cup of tea (and if you have a moderate amount of data) then this deck might be helpful. Basically, one can train nltk's naive bayes classifier that, among other things, allows choosing top N most informative features (so one could then restrict the feature set to, say, top 1000 or top 10000 features - ideally this threshold should be tuned on a holdout sample or using cross validation):

>>> help(nltk.classify.NaiveBayesClassifier.most_informative_features) Help on method most_informative_features in module nltk.classify.naivebayes:

most_informative_features(self, n=100) unbound nltk.classify.naivebayes.NaiveBayesClassifier method
    Return a list of the 'most informative' features used by this
    classifier.  For the purpose of this function, the
    informativeness of a feature C{(fname,fval)} is equal to the
    highest value of P(fname=fval|label), for any label, divided by
    the lowest value of P(fname=fval|label), for any label::

      max[ P(fname=fval|label1) / P(fname=fval|label2) ]

In addition to unigram/bag-of-words based features, one could try adding significant bigrams to the feature list (the deck has some examples). nltk provides multiple ways to calculate significance for collocations (including chi-squared)

Another popular approach is to apply tf-idf to all features first (without any feature selection), and use the regularization (L1 and/or L2) to deal with irrelevant features (the SVM example from the deck corresponds to L2 regularization). The drawback is that the regularization coefficient has to be tuned on a holdout data set or using cross validation.

Best Answer

Related Solutions

Bigram (N-gram) Model – Using Bigram Model to Build Feature Vector for Text Document

Solved – Feature selection methods for document classtification

Related Question