Solved – Simple Explanation of Baum Welch/Viterbi

baum-welchexpectation-maximizationhidden markov modelmachine learningoptimization

I'm looking for a very simple explanation as possible for Baum Welch and Viterbi for HMMs with a straightforward well annotated example. Almost all of the explanations I find on the net invariably jump from the beginning into an almost completely jargon/symbol based explanation. The few times they have an example it is usually shoved to the side or poorly annotated (ie its unclear how the symbols and the example are related to each other). I've heard stuff like some sources claiming Baum Welch is based on the EM algorithm and others saying its based off of the forward backward algorithm and some say both so I'd like to get details like that cleared up.

The closest I found so far is a 1hr video posted here that people said was great but is unplayable because flash is banned and an explanation on wikipedia that has a note next to it saying its wrong.

Best Answer

I have the perfect paper for you: http://www.stat.columbia.edu/~liam/teaching/neurostat-fall17/papers/hmm/rabiner.pdf

Note afaik there are newer BW methods.

Attempted simple explanation for Viterbi: The goal of viterbi is to find an optimal sequence of states during some discrete period of time.
Start off with the initial probabilities of being in a state $s$ at time $t=0$. Then increase $t$ and for each state at that time ($t=1$), find the best state $s$ at $t-1$ (best as in the sequence which gives the highest probability). Save this information. That means for each state $s$ at $t=1$ save what the best state $s$ at $t=0$ is. Also save the probability for that (optimal) sequence.
Then increase $t$ to $t=2$, and repeat the procedure (with the probabilities for each state $s$ at $t-1$ being the just calculated best probabilities for that state). At some point you'll reach $t=T$ (last point in time). Then you just pick the state with the highest probability, and remember for each state at each time you have saved the best state for $t-1$, so take the best state at $t=T$ and then work your way backwards always picking that saved optimal state.

To elaborate on what it does on a more general level: If you have a set of states $S$ which can all be picked over some duration of time $T$, a naive solution to find the best sequence would be to enumerate all possible sequences and then calculate the probability for each of them (that means calculating the product of the state and transition probabilities). The mistake one is doing here is that a lot of the calculations are duplicated, as an example you could have two sequences $p_1, p_2$ ($p$ is a sequence of states) where the only difference is the final state, yet with this approach you are calculating the probabilities of the shared states for both $P(p_1)$ and $P(p_2)$.
By taking the iterative approach used in viterbi, there are no duplicated calculations, so obviously it will be a lot more efficient.