Solved – What are the advantages of combining BiLSTM and CRF

conditional-random-fieldlstmnatural languageneural networksrecurrent neural network

BiLSTM-CRF is a common model for sequence tagging (POS tagging, NER, ect.). What are the advantages of combining BiLSTM and CRF? What is the role of each one of the parts in this combination?

Best Answer

TL;DR: BiLSTM knows about the language, CRF knows the internal logic of the labeling.

With a plain BiLSTM followed by a classifier, each classification decision is conditionally independent. Linear-chain CRF explicitly models dependencies between the labels as a table with transition scores between all pairs of the labels.

If the labels follow strict internal syntax, this is extremely easy for the CRF to learn. For NER, there are several ways of encoding the output, but they typically encode at least: Begining, Iinside and Outside of an entity and obviously these can only be in a syntactically well-defined order. CRF will very quickly catch that it is impossible for instance that I-LOC would follow O, it must always follow B-LOC.

Another example, BiLSTM might be unsure if it should place B-PER on a position or one position later and end up outputting both of them because they are conditionally independent. CRF layer that knows that this is unlikely and enforces the internal logic of the tags and would output B-PER, I-PER.