Machine Learning – Differences Between Baum-Welch Algorithm and Viterbi Training

baum-welchhidden markov modelimage processingmachine learningviterbi-algorithm

I am currently using Viterbi training for an image segmentation problem. I wanted to know what the advantages/disadvantages are of using the Baum-Welch algorithm instead of Viterbi training.

Best Answer

The Baum-Welch algorithm and the Viterbi algorithm calculate different things.

If you know the transition probabilities for the hidden part of your model, and the emission probabilities for the visible outputs of your model, then the Viterbi algorithm gives you the most likely complete sequence of hidden states conditional on both your outputs and your model specification.

The Baum-Welch algorithm gives you both the most likely hidden transition probabilities as well as the most likely set of emission probabilities given only the observed states of the model (and, usually, an upper bound on the number of hidden states). You also get the "pointwise" highest likelihood points in the hidden states, which is often slightly different from the single hidden sequence that is overall most likely.

If you know your model and just want the latent states, then there is no reason to use the Baum-Welch algorithm. If you don't know your model, then you can't be using the Viterbi algorithm.

Edited to add: See Peter Smit's comment; there's some overlap/vagueness in nomenclature. Some poking around led me to a chapter by Luis Javier Rodrıguez and Ines Torres in "Pattern Recognition and Image Analysis" (ISBN 978-3-540-40217-6, pp 845-857) which discusses the speed versus accuracy trade-offs of the two algorithms.

Briefly, the Baum-Welch algorithm is essentially the Expectation-Maximization (EM) algorithm applied to an HMM; as a strict EM-type algorithm you're guaranteed to converge to at least a local maximum, and so for unimodal problems find the MLE. It requires two passes over your data for each step, though, and the complexity gets very big in the length of the data and number of training samples. However, you do end up with the full conditional likelihood for your hidden parameters.

The Viterbi training algorithm (as opposed to the "Viterbi algorithm") approximates the MLE to achieve a gain in speed at the cost of accuracy. It segments the data and then applies the Viterbi algorithm (as I understood it) to get the most likely state sequence in the segment, then uses that most likely state sequence to re-estimate the hidden parameters. This, unlike the Baum-Welch algorithm, doesn't give the full conditional likelihood of the hidden parameters, and so ends up reducing the accuracy while saving significant (the chapter reports 1 to 2 orders of magnitude) computational time.