Introduction to Information Retrieval book contains some relevant material.
If python is your cup of tea (and if you have a moderate amount of data) then this deck might be helpful. Basically, one can train nltk
's naive bayes classifier that, among other things, allows choosing top N most informative features (so one could then restrict the feature set to, say, top 1000 or top 10000 features - ideally this threshold should be tuned on a holdout sample or using cross validation):
>>> help(nltk.classify.NaiveBayesClassifier.most_informative_features) Help on method most_informative_features in module nltk.classify.naivebayes:
most_informative_features(self, n=100) unbound nltk.classify.naivebayes.NaiveBayesClassifier method
Return a list of the 'most informative' features used by this
classifier. For the purpose of this function, the
informativeness of a feature C{(fname,fval)} is equal to the
highest value of P(fname=fval|label), for any label, divided by
the lowest value of P(fname=fval|label), for any label::
max[ P(fname=fval|label1) / P(fname=fval|label2) ]
In addition to unigram/bag-of-words based features, one could try adding significant bigrams to the feature list (the deck has some examples). nltk
provides multiple ways to calculate significance for collocations (including chi-squared)
Another popular approach is to apply tf-idf to all features first (without any feature selection), and use the regularization (L1 and/or L2) to deal with irrelevant features (the SVM example from the deck corresponds to L2 regularization). The drawback is that the regularization coefficient has to be tuned on a holdout data set or using cross validation.
Best Answer
There are a lot of ways for doing feature selection. I'll give you my "system" that works in my domain (insurance).
The first thing I usually try is calculating TF-IDF and taking the top 5-20% depending on how many features there are. To be honest, I don't worry too much about document length at this stage (or more specifically - the difference in lengths between documents) - I feel the TF portion of TF-IDF accounts for that, and IDF essentially creates a "corpus specific" stopword list. But I'm also not comparing Tweets to say War and Peace either.
I then take these features and use Naive Bayes to classify - I consider this to be my "baseline", but it's often good enough - especially since I frequently only need to quantify text for combination with more traditional structured data.
Hope this enough to get you started.