Solved – NER at sentence level or document level

conditional-random-fieldlstmnamed-entity-recognitionnatural languageword embeddings

Should NER models (LSTM or CRF) take input training data at sentence level or paragraph level?

Let's say we have this input text, and we would like to do Named Entity Extraction:

GOP Sen. Rand Paul was assaulted in his home in Bowling Green,
Kentucky, on Friday, according to Kentucky State Police. State
troopers responded to a call to the senator\'s residence at 3:21 p.m.
Friday. Police arrested a man named Rene Albert Boucher, who they
allege "intentionally assaulted" Paul, causing him "minor injury".
Boucher, 59, of Bowling Green was charged with one count of
fourth-degree assault. As of Saturday afternoon, he was being held in
the Warren County Regional Jail on a $5,000 bond.

  1. Paragraph level: we can take it as one record and each token is marked by the entity label. Model has ONE record with LONG sequence.

  2. Sentence level: we first intelligently split the paragraph into 5 correct sentences, and each token in each sentence is marked by the entity label. Model has FIVE records with shorter sequences:

0) GOP Sen. Rand Paul was assaulted in his home in Bowling Green,
Kentucky, on Friday, according to Kentucky State Police.

1) State
troopers responded to a call to the senator's residence at 3:21 p.m.
Friday.

2) Police arrested a man named Rene Albert Boucher, who they allege
"intentionally assaulted" Paul, causing him "minor injury".

3) Boucher, 59, of Bowling Green was charged with one count of
fourth-degree assault.

4) As of Saturday afternoon, he was being held in the Warren County
Regional Jail on a $5,000 bond.

Which one gives the NER modeling a better NER performance?

I tend to think sentence level is better, however, shouldn't LSTM memory cells be trained to remember or forget states automatically if given long paragraphs? Especially when the sentence splitting can also potentially make mistakes, for instance:

1) State troopers responded to a call to the senator's residence at 3:21 p.m. Friday.

could have been

1) State troopers responded to a call to the senator's residence at 3:21 p.m.

2) Friday.

Best Answer

From a computational perspective, you would want to use the sentences, rather than paragraphs when using a Conditional Random Field. The complexity of the Viterbi algorithm used for inference for CRFs is related to the number of words per sentence.

For an approach with LSTMs, this would depend on design decision (i.e. character of word-based input).

More interestingly would be your philosophical/cognitive stance toward what text means to us. If it were not for computational reasons, what input would you choose?

Related Question