I have implemented Viterbi and Forward algorithm, alas strangely I can't understand how does Backward algorithm work. Intuitively I feel like I need to do the same thing as in Forward only backwards, using the values calculated during Forward propagation.
Is my intuition correct?
I have read a lot of slides and sick of math notation at this point. It does not help. I need something that in plain English explains the difference between Backward and Forward algorithms.
Can you provide a short explanation how is Backward algorithm done?
Assume the following small HMM and the results of Forward algorithm for "BB" sequence below:
START -> 1
H: 0.5 * 0.8 = 0.4
L: 0.5 * 0.6 = 0.3
1 -> 2
H: 0.4 * 0.2 * 0.8 + 0.3 * 0.6 * 0.8 = 0.208
L: 0.4 * 0.8 * 0.6 + 0.3 * 0.4 * 0.6 = 0.264
2 -> END
END: 0.208 * 0.3 + 0.264 * 0.7 = 0.2472
Best Answer
Looks like your observation sequence is B,B. Let's denote the observation at time $t$ as $z_t$ and hidden state at time $t$ as $x_t$. If we denote $\alpha_t(i)$ as the forward values and $\beta_t(i)$ as the backward values, ($i$ is one of the possible hidden states)
$\alpha_t(i)=P(x_t=i,z_{1:t})$
This means $\alpha_t(i)$ is the probability of arriving to state $i$ at time $t$ emitting the observations up to time $t$. Then,
$\beta_t(i) = P(z_{t+1:T}\mid x_t=i)$ which is the probability of emitting the remaining sequence from $t+1$ until the end of time after being at hidden state $i$ at time $t$.
To do the recursion on $\beta_t(i)$ we can write,
$P(z_{t+1:T}\mid x_t=i)=\sum\limits_jP(x_{t+1}=j,z_{t+1:T}\mid x_{t}=i)$
Using chain rule, $P(x_{t+1}=j,z_{t+1:T}\mid x_{t}=i) = P(z_{t+2:T},z_{t+1},x_{t+1}=j\mid x_{t}=i)\\ =P(z_{t+2:T}\mid z_{t+1},x_{t+1}=j, x_{t}=i)P(z_{t+1}\mid x_{t+1}=j, x_{t}=i)P(x_{t+1}=j\mid x_{t}=i)$
From conditional independencies of HMM the above probabilities simplifies to $P(z_{t+2:T}\mid x_{t+1}=j)P(z_{t+1}\mid x_{t+1}=j)P(x_{t+1}=j\mid x_{t}=i)$
Note that $P(z_{t+2:T}\mid x_{t+1}=j) =\beta_{t+1}(j) $ from our definition.
Substituting to $P(z_{t+1:T}\mid x_t=i)$ we get,
$\beta_t(i) = P(z_{t+1:T}\mid x_t=i) = \sum\limits_j \beta_{t+1}(j)P(z_{t+1}\mid x_{t+1}=j)P(x_{t+1}=j\mid x_t=i)$
Now you have a recursion for beta. Last two terms of the last equation you know from your model. Here starting from end of the chain (T) we go backward calculating all $\beta_t$, hence the backward algorithm. In forward you have to start from the beginning and you go to the end of chain.
In your model you have to initialize $\beta_T(i) = P(\emptyset \mid x_T=i)=1$ for all $i$. This is the probability of not emitting observations after $T=2$.