Let me begin with a part-of-speech tagging task.
The ultimate goal: given a sentence, what is the most probable part-of-speech tag for each word in the sentence? We want to answer this question by learning a HMM.
Define the state space of observable variable as $D$ (dictionary) which is the collection of all possible words. Define the state space of hidden state variables as $T$ (tag).
Then we can learn the transition probabilities between the tags and the emission probability from a tag to a word with EM algorithm.
Here comes my question: how can we know $t_1$ is verb or noun or preposition? In that sense, how can we reason the meaning of the hidden state variables $t_1, t_2, …, t_{|T|}$?
Best Answer
Two cases:
Regarding the case of unsupervised POS tagging, below is a quick overview of typical evaluation strategies, taken from (3) which I was lucky to TA:
The problem
The hidden states of the HMM are supposed to correspond to parts of speech. However, the model is not given any linguistic information. Furthermore, we do not have the correspondence between numeric hidden states and actual POS tags, such as "noun" or "verb". To evaluate the accuracy of the HMM, we can use a matching algorithm, which attempts to construct a correspondence between hidden states and POS tags that maximizes the POS tagging accuracy. The goal of this matching is to see what POS each hidden state corresponds to.
Let's assume that the HMM has two possible hidden states, $y_1$ and $y_2$, and that there are three possible POS tags, N, V, and ADJ. Consider the two example sentences "Strong stocks are good" and "Big weak stocks are bad" (Figure 1). The HMM has assigned a hidden state ($y_1$ or $y_2$) to each word. In addition, we also know the correct POS tag for each word. Our goal is to use a matching algorithm to match each hidden state to the POS tag that would give us the highest score.
Matching
A matching is a bipartite graph between part-of-speech tags and hidden states. Two types of matchings can be performed: a 1-to-many matching or a 1-to-1 matching. The definitions of 1-to-many and 1-to-1 matchings are given below. Both types of matchings have different advantages and disadvantages when they are used to evaluate the performance of an unsupervised model.
Evaluation
Given a matching, the accuracy of the HMM tagging can be computed against a gold-standard corpus, where each word has been assigned the correct POS tag (based on the Penn WSJ Treebank). The correct tag for a word is compared with the tag that is matched with the word's hidden state. If these two tags match, this means that the tagging is correct for this word. \
To find a matching, we can create a table that contains the counts of each correct POS tag for each hidden state. Here is an example evaluation table for the example sentences from Figure 1.
The matching between hidden states and POS tags can then be found based on this table for both 1-to-many and 1-to-1 matchings.
Finding a good matching
As we can see, the accuracy of the HMM depends on the type of matching that is chosen (1-to-many or 1-to-1). To find a good matching, we can use a greedy algorithm that chooses individual state to tag matchings one at a time and picks the new matching that maximally increases the accuracy at each step. The algorithm terminates when no more matchings can be made. In a 1-to-1 matching, the algorithm terminates when either each POS tag has been assigned a hidden state or each hidden state has received a POS tag (whichever happens first). In a 1-to-many matching, the algorithm terminates when each hidden state has received a POS tag.
Example: 1-to-many Matching: Consider the table from Figure 4. The highest count we see in the table is 3, which corresponds to $y_1$ being assigned the ADJ tag. Thus, we greedily assign state $y_1$ to ADJ. Since the highest count for $y_2$ is also ADJ, state $y_2$ is also assigned to the ADJ tag. $y_1$ tags 3 words correctly, $y_2$ tags 2 words correctly, and there are 9 words in total, so the best score from a 1-to-many matching is $\frac{5}{9}$. The bipartite matching is illustrated in Figure 5.
Example: 1-to-1 Matching We know that the highest count for $y_1$ is 3, which corresponds to an adjective, so we again greedily assign state $y_1$ to ADJ. Even though state $y_2$ also has the highest count for adjectives, ADJ has already been matched to state $y_1$ so we cannot use it again. Thus, $y_2$ must be assigned to V, which has the highest count that is not yet assigned to any state. $y_1$ tags 3 words correctly and $y_2$ tags 1 word correctly, so the total score for a 1-to-1 matching is $\frac{4}{9}$. See Figure 8 for the matching.
Pitfalls of the HMM
Fully unsupervised tagging models generally perform very poorly. The HMM computed with the EM algorithm tends to give a relatively uniform distribution of hidden states, while the empirical distribution for POS tags is highly skewed since some tags are much more common than others. As a result, any 1-to-1 matching will give a POS tag distribution that is relatively uniform, which results in a low accuracy score.
A 1-to-many matching can create a non-uniform distribution by matching a single POS tag with many hidden states. However, as the number of hidden states increases, the model could overfit by giving each word in the vocabulary its own hidden state. If each word has its own hidden state, we would achieve 100 percent accuracy, but the model would not give us any useful information. For more details on the evaluation of HMM for POS tagging, see Why doesn't EM find good HMM POS-taggers? by Mark Johnson.
Improving Unsupervised Models
There are several modifications we can make to improve the accuracy of unsupervised models.