Solved – CRF equivalent in deep learning

conditional-random-fielddeep learningnatural languagesequence analysis

Conditional Random Fields (CRFs) is a typical solution for a sequence labelling/segmentation problem. For example, a sequence is a string and CRFs are used to label each word as being a part of a company name, a location, an event, etc.

What is currently the state-of-the-art equivalent in the deep learning community to CRFs for sequence labelling/segmentation?

CRFs have several implementations, including C++ and Java. Does an implementation exist on the deep learning side?

Best Answer

https://arxiv.org/abs/1606.03475 (De-identification of Patient Notes with Recurrent Neural Networks) uses a neural network with a "label sequence optimization layer" as the top layer to do some sequence tagging, which could be seen as a "deep learning" equivalent to CRF.

See Section 2.2.4 Label sequence optimization layer:

The label sequence optimization layer takes the sequence of probability vectors $\mathbf{a}_{1:n}$ from the label prediction layer as input, and outputs a sequence of labels $y_{1:n}$, where $y_{i}$ is the label assigned to the token $x_{i}$.

The simplest strategy to select the label $y_{i}$ would be to choose the label that has the highest probability in $\mathbf{a}_{i}$, i.e. $y_{i}=\text{argmax}_{k}{\mathbf{a}_{i}[k]}$. However, this greedy approach fails to take into account the dependencies between subsequent labels. For example, it may be more likely to have a token with the PHI type STATE followed by a token with the PHI type ZIP than any other PHI type. Even though the label prediction layer has the capacity to capture such dependencies to a certain degree, it may be preferable to allow the model to directly learn these dependencies in the last layer of the model.

One way to model such dependencies is to incorporate a matrix $T$ that contains the transition probabilities between two subsequent labels. $T[i,j]$ is the probability that a token with label $i$ is followed by a token with the label $j$. The score of a label sequence $y_{1:n}$ is defined as the sum of the probabilities of individual labels and the transition probabilities: $$ s(y_{1:n}) = { \sum_{i=1}^{n} > \mathbf{a}_{i}[y_{i}]+ \sum_{i=2}^{n} T [y_{i-1},y_{i}} ]. $$ These scores can be turned into probabilities of the label sequences by taking a softmax function over all possible label sequences. During the training phase, the objective is to maximize the log probability of the gold label sequence. In the testing phase, given an input sequence of tokens, the corresponding sequence of predicted labels is chosen as the one that maximizes the score.

The network:

enter image description here

Code: https://github.com/Franck-Dernoncourt/NeuroNER

Related Question