BiLSTM-CRF is a common model for sequence tagging (POS tagging, NER, ect.). What are the advantages of combining BiLSTM and CRF? What is the role of each one of the parts in this combination?
Solved – What are the advantages of combining BiLSTM and CRF
conditional-random-fieldlstmnatural languageneural networksrecurrent neural network
Best Answer
TL;DR: BiLSTM knows about the language, CRF knows the internal logic of the labeling.
With a plain BiLSTM followed by a classifier, each classification decision is conditionally independent. Linear-chain CRF explicitly models dependencies between the labels as a table with transition scores between all pairs of the labels.
If the labels follow strict internal syntax, this is extremely easy for the CRF to learn. For NER, there are several ways of encoding the output, but they typically encode at least: Begining, Iinside and Outside of an entity and obviously these can only be in a syntactically well-defined order. CRF will very quickly catch that it is impossible for instance that
I-LOC
would followO
, it must always followB-LOC
.Another example, BiLSTM might be unsure if it should place
B-PER
on a position or one position later and end up outputting both of them because they are conditionally independent. CRF layer that knows that this is unlikely and enforces the internal logic of the tags and would outputB-PER, I-PER
.