Solved – Popular named entity resolution software

machine learningnatural languagerecord-linkagetext mining

I am working on a project and need to extract persons' names from a large amount of documents. This task should belong to the named entity resolution problem. What are currently some of the most popular open source software/libraries to perform the named entity resolution?

Best Answer

The problem of named entity resolution is referred to as multiple terms, including deduplication and record linkage. I doubt that it is possible to determine precisely, what software belong to some of the most popular for solving that problem. There are various approaches and algorithms can be used for named entity resolution. Therefore, software which implements those can be seen as complementary to each other (perhaps, there exist multiple research studies that compare and benchmark entity resolution approaches and algorithms, but so far I have seen only two of them - see references below, denoted with a triple asterisk "***").

This nice tutorial (in a form of presentation slides) on entity resolution provides a comprehensive overview of the problem and the solutions, including both approaches and algorithms. The tutorial also provides an extensive set of references to sources with further information. Speaking about corresponding software, one may find open source or dual-license projects, such as Java-based Stanford NLP Group software (which includes Stanford named entity recognizer (NER)), Stanford Entity Resolution Framework (SERF), LingPipe (which includes a NER module) and Duke library, as well as Python-based NLTK software (http://www.nltk.org/book/ch07.html). I realize that named entity recognition and resolution are quite different tasks, however, some of the above-referenced software, focused on the former, might be useful for the latter, by using appropriate code segments.

Additionally, the following IMHO related/relevant software and papers might also be of interest:

Information Extraction framework in Python;
GATE software, in particular, ANNIE information extraction system;
several other NER tools, mentioned in this paper***;
this excellent overview*** of NER approaches, including neural networks and deep learning;
Ontotext's S4 (Self-Service Semantic Suite) on-demand software provides access to linked data repositories, such as DBpedia, Freebase and GeoNames;
Elasticsearch NER plug-in for Duke;
this paper on Swoosh algorithms, implemented by SERF software;
Wikilinks Corpus, released by Google;
this paper on entity disambiguation;
book "Data Matching" on record linkage, entity resolution, and duplicate detection.

Related Solutions

Solved – How to perform text mining, sentiment mining, and business category identification, and where to obtain a categorization library

One solution mentioned by Jeffrey Breen is to use Lu and Hiu's lexicon. He also gives a cool tutorial for sentiment mining on Twitter.

Solved – Feature selection methods for document classtification

Introduction to Information Retrieval book contains some relevant material.

If python is your cup of tea (and if you have a moderate amount of data) then this deck might be helpful. Basically, one can train nltk's naive bayes classifier that, among other things, allows choosing top N most informative features (so one could then restrict the feature set to, say, top 1000 or top 10000 features - ideally this threshold should be tuned on a holdout sample or using cross validation):

>>> help(nltk.classify.NaiveBayesClassifier.most_informative_features) Help on method most_informative_features in module nltk.classify.naivebayes:

most_informative_features(self, n=100) unbound nltk.classify.naivebayes.NaiveBayesClassifier method
    Return a list of the 'most informative' features used by this
    classifier.  For the purpose of this function, the
    informativeness of a feature C{(fname,fval)} is equal to the
    highest value of P(fname=fval|label), for any label, divided by
    the lowest value of P(fname=fval|label), for any label::

      max[ P(fname=fval|label1) / P(fname=fval|label2) ]

In addition to unigram/bag-of-words based features, one could try adding significant bigrams to the feature list (the deck has some examples). nltk provides multiple ways to calculate significance for collocations (including chi-squared)

Another popular approach is to apply tf-idf to all features first (without any feature selection), and use the regularization (L1 and/or L2) to deal with irrelevant features (the SVM example from the deck corresponds to L2 regularization). The drawback is that the regularization coefficient has to be tuned on a holdout data set or using cross validation.

Best Answer

Related Solutions

Solved – How to perform text mining, sentiment mining, and business category identification, and where to obtain a categorization library

Solved – Feature selection methods for document classtification

Related Question