Solved – the difference between Support Vector Machines and Conditional Random Fields models in the context of Named Entity Recognition

natural languagesvm

Can someone intuitively explain how Support Vector Machines (SVM) and Conditional Random Fields (CRF) models can perform Named Entity Recognition (NER), and the difference between them conceptually? Also is there a preference of one approach over other depending on what we would like to achieve?

Best Answer

Assume that for our data, we have observations (each observation is a set of features) and labels (1 label for each observation). In the case of NER, we have a set of features about a word (observation) and the labels are commonly a BIO scheme, where we might have B-Loc for beginning of location, I-Loc inside location, B-Pers for beginning of person word, I-Pers for inside person word, and O for outside of named entity. The sentence: "The man was George W. Bush." has labeling "The (O) man (O) was (O) George (B-Pers) W. (I-Pers) Bush (I-Pers)."

An SVM is predicting a label given the set of features, and none of these features are related to the label of the preceding observations. However, the features used to train an SVM can be related to the features of the preceding and following observations. For instance, we couldn't have features for preceding-B-Loc preceding-I-Loc, etc, but we can have preceding-NOUN preceding-VERB following-NOUN following-VERB, etc., where each of these features is given a binary 0/1 value (in training). In practice, an SVM might want you to name your features from 1 to |num_features|, and then only list the features names that have value 1. Then, in test, we can generate labels by looking at the observation, such as the part-of-speech of the word, the preceding and following word, etc., and the SVM uses these observations to predict the label. One would create many features, and the observations would be sparse.

In short, the CRF uses the probability of a label given a set of features plus all other labels (training is slow). This is a simplification because the CRF model iteratively propagates the probability of labels forward and backwards (see forward-backward algorithm) until it converges on a local maximum. A CRF uses a graph ( vertices and edges) over the observations of the input: if you picture the tokens from left to right, each token is associated with a vertical column underneath it, where along this column are vertices for all possible labels (B-Loc, I-Loc, B-Pers, I-Pers, O, etc.) - so in our simple BIO scheme, we have 7 vertices per token. There are no edges in a column: each vertex is connected to each vertex in the adjacent columns (forward and back). After the model has run its course, the edge weights are used to predict labels for each token given the observation sequence.

The preference between these choices: it depends. Off-the-shelf NERs like the Stanford NLP Parser give you a pre-trained CRF. If you've got a specialized domain where you need to train, CRF is probably overkill and the SVM is sufficient.