Hidden Markov Models – Definition of Probability in Forward-Backward Method

conditional probabilityforward-backwardhidden markov modeljoint distribution

I am confused about the way that the Jurafsky and Martin book (Appendix A, page 6) explains the relationship between the observations and hidden states:

Each cell of the forward algorithm trellis $α_t(j)$ represents the
probability of being in state $j$ after seeing the first $t$
observations, given the automaton $λ$

Then they define $\alpha_t(j)$ as $P(o_1,o_2,\dots,o_t,q_t = j|λ)$.

I am not sure why the quoted text refers to the joint probability between the observations and the state at time $t$. I'd rather define it as $\alpha t(j) = P(q_t = j|o_1,o_2,\dots,o_t,λ)$. I think the conditional distribution makes more sense to use here, as we first see the first $t$ observations then we find the probability of being in state $j$.

A separate question: Although I am not convinced that the joint distribution is what we should use here, I think they marginalized over all previous states:

\begin{align}
\alpha t(j) &= \sum_{q_1,\dots,q_{t-1}}P(q_1, q_2,\dots,q_t=j,o_1,o_2,\dots,o_t|λ) \\
&=P(o_1,o_2,\dots,o_t,q_t = j|λ)
\end{align}

Is this correct?

Best Answer

There are two questions here - I'll give the short answers, then elaborate.

  1. Should $\alpha_t$ be defined in terms of the joint distribution of the state and observations, or should it be conditioned? It should be joint.
  2. Are the two formulas you wrote equivalent? Yes, by the law of total probability.

You may prefer to define $\alpha_t$ conditionally, but the J&M book is right. In the forward algorithm, you marginalize over all of the possible previous states to determine what the next state should be. (That's your separate question, but it's relevant here.) Your conditional distribution can be defined in terms of the joint distribution:

$$ P(q_t = j|o_1,o_2,\dots,o_t,λ) = \frac{ P(o_1,o_2,\dots,o_t,q_t = j|λ) }{ P(o_1,o_2,\dots,o_t|λ) } $$

But then—how do you compute the denominator? It's only in terms of what the forward–backward algorithm computes: you have to marginalize the joint distribution over all hidden states. You've got a problem of circularity. We instead use the joint distribution in order to compute the conditional probabilities. The forward algorithm lets us do this efficiently.

Related Question