Solved – Interpretation of hidden states in HMM in the part-of-speech tagging task

hidden markov modelmarkov-processnatural language

Let me begin with a part-of-speech tagging task.

The ultimate goal: given a sentence, what is the most probable part-of-speech tag for each word in the sentence? We want to answer this question by learning a HMM.

Define the state space of observable variable as $D$ (dictionary) which is the collection of all possible words. Define the state space of hidden state variables as $T$ (tag).

Then we can learn the transition probabilities between the tags and the emission probability from a tag to a word with EM algorithm.

Here comes my question: how can we know $t_1$ is verb or noun or preposition? In that sense, how can we reason the meaning of the hidden state variables $t_1, t_2, …, t_{|T|}$?

Best Answer

Two cases:

supervised POS tagging: no need for EM, you can simply count to learn the emission and transition probabilities (2). Each hidden state is mapped to one single POS.
unsupervised POS tagging: the Baum-Welch (EM) is typically used learn the emission and transition probabilities. To map the hidden states to actual POS, it depends what information regarding the actual POS you have. You can look at the literature on POS induction to see how they map hidden states with gold POS tags, e.g. see (1) Section 3 Evaluation Measures.

Regarding the case of unsupervised POS tagging, below is a quick overview of typical evaluation strategies, taken from (3) which I was lucky to TA:

The problem

The hidden states of the HMM are supposed to correspond to parts of speech. However, the model is not given any linguistic information. Furthermore, we do not have the correspondence between numeric hidden states and actual POS tags, such as "noun" or "verb". To evaluate the accuracy of the HMM, we can use a matching algorithm, which attempts to construct a correspondence between hidden states and POS tags that maximizes the POS tagging accuracy. The goal of this matching is to see what POS each hidden state corresponds to.

Let's assume that the HMM has two possible hidden states, $y_1$ and $y_2$, and that there are three possible POS tags, N, V, and ADJ. Consider the two example sentences "Strong stocks are good" and "Big weak stocks are bad" (Figure 1). The HMM has assigned a hidden state ($y_1$ or $y_2$) to each word. In addition, we also know the correct POS tag for each word. Our goal is to use a matching algorithm to match each hidden state to the POS tag that would give us the highest score.

Matching

A matching is a bipartite graph between part-of-speech tags and hidden states. Two types of matchings can be performed: a 1-to-many matching or a 1-to-1 matching. The definitions of 1-to-many and 1-to-1 matchings are given below. Both types of matchings have different advantages and disadvantages when they are used to evaluate the performance of an unsupervised model.

1-to-Many Matching: In a 1-to-many matching, each hidden state is assigned to with the POS tag that gives the most correct matches. Each POS tag can be matched to multiple hidden states, hence the name "1-to-many".
1-to-1 Matching: In contrast, in a 1-to-1 matching, each POS tag can only be matched to a maximum of one hidden state. In this case, if the number of hidden states is greater than the number of tags, then some hidden states will not correspond to any tag. This will decrease the reported accuracy.

Evaluation

Given a matching, the accuracy of the HMM tagging can be computed against a gold-standard corpus, where each word has been assigned the correct POS tag (based on the Penn WSJ Treebank). The correct tag for a word is compared with the tag that is matched with the word's hidden state. If these two tags match, this means that the tagging is correct for this word. \

To find a matching, we can create a table that contains the counts of each correct POS tag for each hidden state. Here is an example evaluation table for the example sentences from Figure 1.

The matching between hidden states and POS tags can then be found based on this table for both 1-to-many and 1-to-1 matchings.

Finding a good matching

As we can see, the accuracy of the HMM depends on the type of matching that is chosen (1-to-many or 1-to-1). To find a good matching, we can use a greedy algorithm that chooses individual state to tag matchings one at a time and picks the new matching that maximally increases the accuracy at each step. The algorithm terminates when no more matchings can be made. In a 1-to-1 matching, the algorithm terminates when either each POS tag has been assigned a hidden state or each hidden state has received a POS tag (whichever happens first). In a 1-to-many matching, the algorithm terminates when each hidden state has received a POS tag.

Example: 1-to-many Matching: Consider the table from Figure 4. The highest count we see in the table is 3, which corresponds to $y_1$ being assigned the ADJ tag. Thus, we greedily assign state $y_1$ to ADJ. Since the highest count for $y_2$ is also ADJ, state $y_2$ is also assigned to the ADJ tag. $y_1$ tags 3 words correctly, $y_2$ tags 2 words correctly, and there are 9 words in total, so the best score from a 1-to-many matching is $\frac{5}{9}$. The bipartite matching is illustrated in Figure 5.

Example: 1-to-1 Matching We know that the highest count for $y_1$ is 3, which corresponds to an adjective, so we again greedily assign state $y_1$ to ADJ. Even though state $y_2$ also has the highest count for adjectives, ADJ has already been matched to state $y_1$ so we cannot use it again. Thus, $y_2$ must be assigned to V, which has the highest count that is not yet assigned to any state. $y_1$ tags 3 words correctly and $y_2$ tags 1 word correctly, so the total score for a 1-to-1 matching is $\frac{4}{9}$. See Figure 8 for the matching.

Pitfalls of the HMM

Fully unsupervised tagging models generally perform very poorly. The HMM computed with the EM algorithm tends to give a relatively uniform distribution of hidden states, while the empirical distribution for POS tags is highly skewed since some tags are much more common than others. As a result, any 1-to-1 matching will give a POS tag distribution that is relatively uniform, which results in a low accuracy score.

A 1-to-many matching can create a non-uniform distribution by matching a single POS tag with many hidden states. However, as the number of hidden states increases, the model could overfit by giving each word in the vocabulary its own hidden state. If each word has its own hidden state, we would achieve 100 percent accuracy, but the model would not give us any useful information. For more details on the evaluation of HMM for POS tagging, see Why doesn't EM find good HMM POS-taggers? by Mark Johnson.

Improving Unsupervised Models

There are several modifications we can make to improve the accuracy of unsupervised models.

Supply the model with a dictionary and limit possibilities for each word: For example, we know that the word "train" can only be a verb or a noun. This limits the set of possible tags for each word, which will push the model in the correct direction.
Prototypes: Provide each POS tag with several representative words for the tag. Even just a few words for each tag will help push the model in the right direction.

(1) Christodoulopoulos, Christos, Sharon Goldwater, and Mark Steedman. "Two decades of unsupervised POS induction: How far have we come?." In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pp. 575-584. Association for Computational Linguistics, 2010.
(2) Speech and Language Processing. Daniel Jurafsky & James H. Martin. Draft of February 19, 2015. Chapter 9 Part-of-Speech Tagging https://web.stanford.edu/~jurafsky/slp3/9.pdf
(3) MIT 6.806/6.864 Advanced Natural Language Processing course. Lecture 9' scribe note. Fall 2015. Instructors: Prof. Regina Barzilay, and Prof. Tommi Jaakkola. TAs: Franck Dernoncourt, Karthik Rajagopal Narasimhan, Tianheng Wang. Scribes: Clare Liu, Evan Pu, Kalki Seksaria https://stellar.mit.edu/S/course/6/fa15/6.864/

Best Answer

The problem

Matching

Evaluation

Finding a good matching

Pitfalls of the HMM

Improving Unsupervised Models

Related Solutions

Solved – Hidden Markov Model to predict the next state

Solved – Calculating Emission Probability values for Hidden Markov Model (HMM)

Related Question