I am currently using Viterbi training for an image segmentation problem. I wanted to know what the advantages/disadvantages are of using the Baum-Welch algorithm instead of Viterbi training.
Machine Learning – Differences Between Baum-Welch Algorithm and Viterbi Training
baum-welchhidden markov modelimage processingmachine learningviterbi-algorithm
Related Solutions
Refer to one of the answers (by Masterfool) from the question link you provided,
Morat's answer is false on one point: Baum-Welch is an Expectation-Maximization algorithm, used to train an HMM's parameters. It uses the forward-backward algorithm during each iteration. The forward-backward algorithm really is just a combination of the forward and backward algorithms: one forward pass, one backward pass.
And I agree with PierreE's answer here, Baum–Welch algorithm is used to solve maximum likelihood in HHM. If the states are known (supervised, labeled sequence), then other method maximizing MLE is used (maybe like, simply count the frequency of each emission and transition observed in the training data, see the slides provided by Franck Dernoncourt).
In the setting of MLE for HMM, I don't think you can just use gradient descent, since the likelihood (or, log-likelihood) doesn't have a closed-form solution and must be solved iteratively, same as the case in mixture models so then we turn to EM. (See more details in Bishop, Pattern Recognition book, chapter 13.2.1 Pg614)
Background and Notation
Let $x_1, \ldots, x_T$ be the observations, and $z_1, \ldots, z_T$ be the hidden states. The forward-backward recursions are written in terms of $\alpha(z_n)$ and $\beta(z_n)$.
$\alpha(z_n) = p(x_1,\ldots,x_n,z_n)$ is the joint distribution of all the heretofore observed data and the most recent state. As $n$ gets large, this gets very small. This is just because of simple properties of probabilities. Probabilities of subsets get smaller. This is axiomatic.
$\beta(z_n) = p(x_{n+1},\ldots,x_T|z_n)$ is the probability of all the future data given the current state. This gets small as $n$ goes down, for the same reasons as above.
Forward Algorithm
You can see that both of these things become very small, and hence cause numerical issues, whenever you have a non-tiny dataset. So instead of using the forward recursion $$ \alpha(z_n) = p(x_n|z_n) \sum_{z_{n-1}}\alpha(z_{n-1})p(z_n|z_{n-1}) $$ use the filtering recursion: $$ p(z_n|x_1,\ldots,x_n) = \frac{p(x_n|z_n)\sum_{z_{n-1}} p(z_n|z_{n-1})p(z_{n-1}|x_1,\ldots,x_{n-1})}{p(x_n|x_1,\ldots,x_{n-1})}. $$ You can get this algebraically by dividing both sides of the first recursion by $p(x_1,\ldots,x_n)$. But I would program it by keeping track of the filtering distributions, not the $\alpha$ quantities. You also don't need to worry about the normalizing constant usually--the last step you can normalize your probability vector if your state space isn't too large.
I am more familiar with the non-HMM state space model literature, and the filtering recursions are probably the most common set of recursions. I have found a few results on the internet where HMM people call this procedure "scaling." So it's nice to find this connection and figure out what all this $\alpha$ stuff is.
Backward Algorithm
Next, the traditional HMM backward algorithm is stated as $$ \beta(z_n) = \sum_{z_{n+1}}\beta(z_{n+1})p(x_{n+1}|z_{n+1})p(z_{n+1}|z_n). $$ This one is trickier though. I have found a few websites that say to divide both sides by $p(x_1,\ldots,x_n)$ again, and I have also found resources that say to divide both sides by $p(x_{t+1},\ldots,x_T|x_1,\ldots,x_t)$. I'm still working this out, though. Also, I suspect there's another connection between this and the backward recursions in other areas of state space model literature (for marginal and joint smoothing distributions). I'll keep you posted.
Edit:
It turns out if you multiply both sides of the HMM backward equation above by $p(x_1, \ldots, x_n,z_n)/p(x_1,\ldots,x_T)$ you get the backward equation for the marginal smoothing distribution, which are more common in non-HMM state space models.
$$ p(z_n|x_1,\ldots,x_T) = p(z_n|x_1,\ldots,x_n) \sum_{z_{n+1}}\frac{p(z_{n+1}|z_n)}{p(z_{n+1}|x_1,\ldots,x_n)}p(z_{n+1}|x_1,\ldots,x_T). $$
I don't know if this is standard for using Baum-Welch, but it's pretty standard with other state space models. Cool.
Edit 2:
https://www.springer.com/us/book/9780387402642 define on page 63 the normalized backward function as
$$ \hat{\beta}(z_n) = \beta(z_n)\frac{p(x_{1:n})}{p(x_{1:T})} = \frac{p(z_n \mid x_{1:T}) }{p(z_n \mid x_{1:n})}, $$ which, as they mention on page 65, has the interpretation of the ratio of two marginal smoothing distributions.
Best Answer
The Baum-Welch algorithm and the Viterbi algorithm calculate different things.
If you know the transition probabilities for the hidden part of your model, and the emission probabilities for the visible outputs of your model, then the Viterbi algorithm gives you the most likely complete sequence of hidden states conditional on both your outputs and your model specification.
The Baum-Welch algorithm gives you both the most likely hidden transition probabilities as well as the most likely set of emission probabilities given only the observed states of the model (and, usually, an upper bound on the number of hidden states). You also get the "pointwise" highest likelihood points in the hidden states, which is often slightly different from the single hidden sequence that is overall most likely.
If you know your model and just want the latent states, then there is no reason to use the Baum-Welch algorithm. If you don't know your model, then you can't be using the Viterbi algorithm.
Edited to add: See Peter Smit's comment; there's some overlap/vagueness in nomenclature. Some poking around led me to a chapter by Luis Javier Rodrıguez and Ines Torres in "Pattern Recognition and Image Analysis" (ISBN 978-3-540-40217-6, pp 845-857) which discusses the speed versus accuracy trade-offs of the two algorithms.
Briefly, the Baum-Welch algorithm is essentially the Expectation-Maximization (EM) algorithm applied to an HMM; as a strict EM-type algorithm you're guaranteed to converge to at least a local maximum, and so for unimodal problems find the MLE. It requires two passes over your data for each step, though, and the complexity gets very big in the length of the data and number of training samples. However, you do end up with the full conditional likelihood for your hidden parameters.
The Viterbi training algorithm (as opposed to the "Viterbi algorithm") approximates the MLE to achieve a gain in speed at the cost of accuracy. It segments the data and then applies the Viterbi algorithm (as I understood it) to get the most likely state sequence in the segment, then uses that most likely state sequence to re-estimate the hidden parameters. This, unlike the Baum-Welch algorithm, doesn't give the full conditional likelihood of the hidden parameters, and so ends up reducing the accuracy while saving significant (the chapter reports 1 to 2 orders of magnitude) computational time.