Hidden Markov Model – Training with Multiple Instances

hidden markov model

I've implemented a discrete HMM according to this tutorial
http://cs229.stanford.edu/section/cs229-hmm.pdf

This tutorial and others always speak of training a HMM given an observation sequence.

What happens when I have multiple training sequences? Should I just run them sequentially , training the model after the other?

Another option is to concatenate the sequences to one and train on it, but then I will have state transitions from the end of one sequence to the start of the next one which are not real.

Best Answer

Neither concatenating nor running each iteration of training with a different sequence is right thing to do. The correct approach requires some explanation:

One usually trains an HMM using an E-M algorithm. This consists of several iterations. Each iteration has one "estimate" and one "maximize" step. In the "maximize" step, you align each observation vector x with a state s in your model so that some likelihood measure is maximized. In the "estimate" step, for each state s, you estimate (a) the parameters of a statistical model for the x vectors aligned to s and (b) the state transition probabilities. In the following iteration, the maximize step runs again with the updated statistical models, etc. The process is repeated a set number of times or when the likelihood measure stops rising significantly (i.e, the model converges to a stable solution). Finally, (at least in speech recognition) an HMM will typically have a designated "start" state which is aligned to the first observation of the observation sequence and have a "left to right" topology so that once you leave a state you don't return to it.

So, if you have multiple training sequences, on the estimate step you should run each sequence so that it's initial observation vector aligns with the initial state. That way, the statistics on that initial state are collected from the first observations over all your observation sequences, and in general observation vectors are aligned to the most likely states throughout each sequence. You would only do the maximize step (and future iterations) after all sequences have been provided for training. On next iteration, you'd do exactly same thing.

By aligning the start of each observation sequence to the initial state you avoid the problem of concatenating sequences where you'd be incorrectly modelling transitions between the end of one sequence and beginning of next. And by using all the sequences on each iteration you avoid providing different sequences each iteration, which as the responder noted, will not guarantee convergence.

Related Question