Solved – Scaling step in Baum-Welch algorithm

baum-welchhidden markov modeloptimal-scaling

I am implementing the Baum-Welch Algorithm for training a Hidden Markov Process, to basically better understand the training process.

I have implemented the iterative procedures described in Rabiner's classic paper. My Implementation is in Wolfram Mathematica. The problem I am facing is the scaling step, if the sequence is 100 observations long then the probabilities easily cross the bounds of $10^{-30}$.

My question is can anyone explain or provide a link where the calculation of scaled forward and backward matrices is described?

For the forward procedure the code is as follows:

 ForwardProcedure[TM_, EM_, P_, N_, M_, Seq_] := 
 Module[{DP, i, j, DPPart},
  DP = ConstantArray[0, {N, M}];
  DP[[1, All]] = Table[P[[i]]*EM[[i, Part[Seq, 1]]], {i, 1, N}];
  DP = Table[
    If[j == 1, P[[i]]*EM[[i, Part[Seq, 1]]], 0]
    , {i, 1, N}, {j, 1, Length@Seq}];
  For[j = 2, j <= Length@Seq, j++,
   DPPart = DP[[All, j - 1]];
   For[i = 1, i <= N, i++,
    DP[[i, j]] = Dot[DPPart, TM[[All, i]]]*EM[[i, Part[Seq, j]]];
    ];
   ];
  DP
  ]

Essentially DP[[i, j]] = Dot[DPPart, TM[[All, i]]]*EM[[i, Part[Seq, j]]]; This line is the update. Which simply takes the needed columns from Transition (TM), The alpha (DP) matrices and multiplies with the associated Emission value.

How can I have a scaling factor independent of i, when not all the t-s of the row of the DP matrix are filled?

Best Answer

Background and Notation

Let $x_1, \ldots, x_T$ be the observations, and $z_1, \ldots, z_T$ be the hidden states. The forward-backward recursions are written in terms of $\alpha(z_n)$ and $\beta(z_n)$.

$\alpha(z_n) = p(x_1,\ldots,x_n,z_n)$ is the joint distribution of all the heretofore observed data and the most recent state. As $n$ gets large, this gets very small. This is just because of simple properties of probabilities. Probabilities of subsets get smaller. This is axiomatic.

$\beta(z_n) = p(x_{n+1},\ldots,x_T|z_n)$ is the probability of all the future data given the current state. This gets small as $n$ goes down, for the same reasons as above.

Forward Algorithm

You can see that both of these things become very small, and hence cause numerical issues, whenever you have a non-tiny dataset. So instead of using the forward recursion $$ \alpha(z_n) = p(x_n|z_n) \sum_{z_{n-1}}\alpha(z_{n-1})p(z_n|z_{n-1}) $$ use the filtering recursion: $$ p(z_n|x_1,\ldots,x_n) = \frac{p(x_n|z_n)\sum_{z_{n-1}} p(z_n|z_{n-1})p(z_{n-1}|x_1,\ldots,x_{n-1})}{p(x_n|x_1,\ldots,x_{n-1})}. $$ You can get this algebraically by dividing both sides of the first recursion by $p(x_1,\ldots,x_n)$. But I would program it by keeping track of the filtering distributions, not the $\alpha$ quantities. You also don't need to worry about the normalizing constant usually--the last step you can normalize your probability vector if your state space isn't too large.

I am more familiar with the non-HMM state space model literature, and the filtering recursions are probably the most common set of recursions. I have found a few results on the internet where HMM people call this procedure "scaling." So it's nice to find this connection and figure out what all this $\alpha$ stuff is.

Backward Algorithm

Next, the traditional HMM backward algorithm is stated as $$ \beta(z_n) = \sum_{z_{n+1}}\beta(z_{n+1})p(x_{n+1}|z_{n+1})p(z_{n+1}|z_n). $$ This one is trickier though. I have found a few websites that say to divide both sides by $p(x_1,\ldots,x_n)$ again, and I have also found resources that say to divide both sides by $p(x_{t+1},\ldots,x_T|x_1,\ldots,x_t)$. I'm still working this out, though. Also, I suspect there's another connection between this and the backward recursions in other areas of state space model literature (for marginal and joint smoothing distributions). I'll keep you posted.

Edit:

It turns out if you multiply both sides of the HMM backward equation above by $p(x_1, \ldots, x_n,z_n)/p(x_1,\ldots,x_T)$ you get the backward equation for the marginal smoothing distribution, which are more common in non-HMM state space models.

$$ p(z_n|x_1,\ldots,x_T) = p(z_n|x_1,\ldots,x_n) \sum_{z_{n+1}}\frac{p(z_{n+1}|z_n)}{p(z_{n+1}|x_1,\ldots,x_n)}p(z_{n+1}|x_1,\ldots,x_T). $$

I don't know if this is standard for using Baum-Welch, but it's pretty standard with other state space models. Cool.

Edit 2:

https://www.springer.com/us/book/9780387402642 define on page 63 the normalized backward function as

$$ \hat{\beta}(z_n) = \beta(z_n)\frac{p(x_{1:n})}{p(x_{1:T})} = \frac{p(z_n \mid x_{1:T}) }{p(z_n \mid x_{1:n})}, $$ which, as they mention on page 65, has the interpretation of the ratio of two marginal smoothing distributions.

Related Question