Introduction to Information Retrieval book contains some relevant material.
If python is your cup of tea (and if you have a moderate amount of data) then this deck might be helpful. Basically, one can train nltk
's naive bayes classifier that, among other things, allows choosing top N most informative features (so one could then restrict the feature set to, say, top 1000 or top 10000 features - ideally this threshold should be tuned on a holdout sample or using cross validation):
>>> help(nltk.classify.NaiveBayesClassifier.most_informative_features) Help on method most_informative_features in module nltk.classify.naivebayes:
most_informative_features(self, n=100) unbound nltk.classify.naivebayes.NaiveBayesClassifier method
Return a list of the 'most informative' features used by this
classifier. For the purpose of this function, the
informativeness of a feature C{(fname,fval)} is equal to the
highest value of P(fname=fval|label), for any label, divided by
the lowest value of P(fname=fval|label), for any label::
max[ P(fname=fval|label1) / P(fname=fval|label2) ]
In addition to unigram/bag-of-words based features, one could try adding significant bigrams to the feature list (the deck has some examples). nltk
provides multiple ways to calculate significance for collocations (including chi-squared)
Another popular approach is to apply tf-idf to all features first (without any feature selection), and use the regularization (L1 and/or L2) to deal with irrelevant features (the SVM example from the deck corresponds to L2 regularization). The drawback is that the regularization coefficient has to be tuned on a holdout data set or using cross validation.
Best Answer
The problem of named entity resolution is referred to as multiple terms, including deduplication and record linkage. I doubt that it is possible to determine precisely, what software belong to some of the most popular for solving that problem. There are various approaches and algorithms can be used for named entity resolution. Therefore, software which implements those can be seen as complementary to each other (perhaps, there exist multiple research studies that compare and benchmark entity resolution approaches and algorithms, but so far I have seen only two of them - see references below, denoted with a triple asterisk "***").
This nice tutorial (in a form of presentation slides) on entity resolution provides a comprehensive overview of the problem and the solutions, including both approaches and algorithms. The tutorial also provides an extensive set of references to sources with further information. Speaking about corresponding software, one may find open source or dual-license projects, such as Java-based Stanford NLP Group software (which includes Stanford named entity recognizer (NER)), Stanford Entity Resolution Framework (SERF), LingPipe (which includes a NER module) and Duke library, as well as Python-based NLTK software (http://www.nltk.org/book/ch07.html). I realize that named entity recognition and resolution are quite different tasks, however, some of the above-referenced software, focused on the former, might be useful for the latter, by using appropriate code segments.
Additionally, the following IMHO related/relevant software and papers might also be of interest: