Trainig Hidden Markov Model using Maximum Likelihood criterion has two cases:
- Observed data: When your training data is fully observed and there is no hidden variables, the only thing you need to do is to count the frequencies and form the three matrices: $ \pi $,$A$ and $B$.
- Unobserved data: When there are hidden variables, there will be some free parameters which could not be found by counting. At this moment, Expectation-Maximization(EM) or namely Baum-Welch is used. In Baum-Welch algorithm there are two steps done on all instances in each training iteration: first is to calculate the expectation of training instances and second is to maximize it.
So, the answer to your question is in every training iteration, all instances are involved. Read this great tutorial on HMM, I'm sure it will help you.
It will be helpful to distinguish the model from inference you want to make with it, because now standard terminology mixes the two.
The model is the part where you specify the nature of: the hidden space (discrete or continuous), the hidden state dynamics (linear or non-linear) the nature of the observations (typically conditionally multinomial or Normal), and the measurement model connecting the hidden state to the observations. HMMs and state space models are two such sets of model specifications.
For any such model there are three standard tasks: filtering, smoothing, and prediction. Any time series text (or indeed google) should give you an idea of what they are. Your question is about filtering, which is a way to get a) a posterior distribution over (or 'best' estimate of, for some sense of best, if you're not feeling Bayesian) the hidden state at $t$ given the complete set of of data up to and including time $t$, and relatedly b) the probability of the data under the model.
In situations where the state is continuous, the state dynamics and measurement linear and all noise is Normal, a Kalman Filter will do that job efficiently. Its analogue when the state is discrete is the Forward Algorithm. In the case where there is non-Normality and/or non-linearity, we fall back to approximate filters. There are deterministic approximations, e.g. an Extended or Unscented Kalman Filters, and there are stochastic approximations, the best known of which being the Particle Filter.
The general feeling seems to be that in the presence of unavoidable non-linearity in the state or measurement parts or non-Normality in the observations (the common problem situations), one tries to get away with the cheapest approximation possible. So, EKF then UKF then PF.
The literature on the Unscented Kalman filter usually has some comparisons of situations when it might work better than the traditional linearization of the Extended Kalman Filter.
The Particle Filter has almost complete generality - any non-linearity, any distributions - but it has in my experience required quite careful tuning and is generally much more unwieldy than the others. In many situations however, it's the only option.
As for further reading: I like ch.4-7 of Särkkä's Bayesian Filtering and Smoothing, though it's quite terse. The author makes has an online copy available for personal use. Otherwise, most state space time series books will cover this material. For Particle Filtering, there's a Doucet et al. volume on the topic, but I guess it's quite old now. Perhaps others will point out a newer reference.
Best Answer
This seems very good one:
http://sourceforge.net/projects/cvhmm/
Some times ago I developed a HMM libs, but it is only with discrete states so no speech recognition. You can find it here. Adding the missing part should not be too hard, also because I used the Armadillo linear algebra lib for translating from some Matlab code that can handle any kind of data.
I studied a little bit the theory and the code from this C# super good library:
http://www.codeproject.com/Articles/541428/Sequence-Classifiers-in-Csharp-Part-I-Hidden-Marko
Hope it helps!