Trainig Hidden Markov Model using Maximum Likelihood criterion has two cases:
- Observed data: When your training data is fully observed and there is no hidden variables, the only thing you need to do is to count the frequencies and form the three matrices: $ \pi $,$A$ and $B$.
- Unobserved data: When there are hidden variables, there will be some free parameters which could not be found by counting. At this moment, Expectation-Maximization(EM) or namely Baum-Welch is used. In Baum-Welch algorithm there are two steps done on all instances in each training iteration: first is to calculate the expectation of training instances and second is to maximize it.
So, the answer to your question is in every training iteration, all instances are involved. Read this great tutorial on HMM, I'm sure it will help you.
Looks like your observation sequence is B,B.
Let's denote the observation at time $t$ as $z_t$ and hidden state at time $t$ as $x_t$. If we denote $\alpha_t(i)$ as the forward values and $\beta_t(i)$ as the backward values, ($i$ is one of the possible hidden states)
$\alpha_t(i)=P(x_t=i,z_{1:t})$
This means $\alpha_t(i)$ is the probability of arriving to state $i$ at time $t$ emitting the observations up to time $t$.
Then,
$\beta_t(i) = P(z_{t+1:T}\mid x_t=i)$ which is the probability of emitting the remaining sequence from $t+1$ until the end of time after being at hidden state $i$ at time $t$.
To do the recursion on $\beta_t(i)$ we can write,
$P(z_{t+1:T}\mid x_t=i)=\sum\limits_jP(x_{t+1}=j,z_{t+1:T}\mid x_{t}=i)$
Using chain rule,
$P(x_{t+1}=j,z_{t+1:T}\mid x_{t}=i) = P(z_{t+2:T},z_{t+1},x_{t+1}=j\mid x_{t}=i)\\
=P(z_{t+2:T}\mid z_{t+1},x_{t+1}=j, x_{t}=i)P(z_{t+1}\mid x_{t+1}=j, x_{t}=i)P(x_{t+1}=j\mid x_{t}=i)$
From conditional independencies of HMM the above probabilities simplifies to
$P(z_{t+2:T}\mid x_{t+1}=j)P(z_{t+1}\mid x_{t+1}=j)P(x_{t+1}=j\mid x_{t}=i)$
Note that $P(z_{t+2:T}\mid x_{t+1}=j) =\beta_{t+1}(j) $ from our definition.
Substituting to $P(z_{t+1:T}\mid x_t=i)$ we get,
$\beta_t(i) = P(z_{t+1:T}\mid x_t=i) = \sum\limits_j \beta_{t+1}(j)P(z_{t+1}\mid x_{t+1}=j)P(x_{t+1}=j\mid x_t=i)$
Now you have a recursion for beta. Last two terms of the last equation you know from your model. Here starting from end of the chain (T) we go backward calculating all $\beta_t$, hence the backward algorithm. In forward you have to start from the beginning and you go to the end of chain.
In your model you have to initialize $\beta_T(i) = P(\emptyset \mid x_T=i)=1$ for all $i$.
This is the probability of not emitting observations after $T=2$.
Best Answer
Let $\mathbf{X}$ be an observation sequence and $\lambda$ be a Hidden Markov Model (HMM). Then the forward algorithm determines $\textrm{Pr}(\mathbf{X}|\lambda)$, the likelihood of realizing sequence $\mathbf{X}$ from HMM $\lambda$.
In more plain English terms... Let's say you trained up your HMM $\lambda$ and you'd like to see how likely it is that $\lambda$ produced some sequence $\mathbf{X}$. The forward algorithm would do this for you. If you get a relatively high likelihood, there's a good chance that $\lambda$ produced $\mathbf{X}$. If you had two HMMs $\lambda_1$ and $\lambda_2$, you might conclude the one with the higher likelihood is the best model to explain your $\mathbf{X}$.
Let's look at some R code...First we'll initialize an HMM. (I got this from the HMM manuals on CRAN http://cran.r-project.org/web/packages/HMM/HMM.pdf)....
This is an HMM in which has an 80% chance of staying in whatever hidden state it was in at time $t$ when it transitions to time $t+1$. It has two hidden states, A and B. It emits two observations, L and R. The emission probabilities are contained in
emissionProbs
. We store the observation sequence $\mathbf{X}$ inobservations
.forwardProbabilities
is now a matrix containing the probability of realizing the observation sequence up to time $t$ and ending up in state $i$. Hidden states $i$ are in rows and times $t$ are in columns. If we want the probability of the overall sequence, we use column 4 because it is the last timestep. The resulting 2 element row vector contains the probabilities of ending in A and ending in B. Their sum is the total probability of realizing $\mathbf{X}$.