Solved – Understanding hidden markov model, and how it is applied in speech recognition

gaussian mixture distributionhidden markov modelspeech recognition

I have for some some time tried to understand how this hidden markov model (hmm) works, and have found a lot of tutorials/papers on it which make use of the same examples/principles of explaining the concept.

As far i've understood is the hidden part in hmm due to one not knowing what state you are in. It could be that the observation one is looking at only has a certain probability of occurring in a certain state..

If i perceive the hidden markov model as a function, in which i feed in my observation/oberservations, what is my actual output? A state? a probabilities of the different states?

And depending on the lexicon size, doing this would take quite some time?

And in the case of an ASR/speech recognition system what is a state?…
Is it each word?, or is it a phoneme? or something completely different?

how does hmm and gmm work together in different ASR systems?..

Best Answer

It is better to read Rabiner's tutorial, it provides enough information and not very complex to comprehend.

If i perceive the hidden markov model as a function, in which i feed in my observation/oberservations, what is my actual output? A state? a probabilities of the different states?

Hidden markov model is not a function. It is a "model". It describes how to estimate a probability of the alignment of the observable sequence and hidden sequence. You can not just feed the observation in.

In speech recognition you find most probable sequence of hidden states. For that you consider all possible hidden state sequences and all possible alignments between hidden state and observable state and for every alignment you compute the probability of the alignment. Then you take the most probable alignment as a result.

And depending on the lexicon size, doing this would take quite some time?

There are quite efficient approaches, but it still takes a lot of time. Not just due to lexicon size, but also due to the fact you have to consider many possible alignments.

And in the case of an ASR/speech recognition system what is a state?... Is it each word?, or is it a phoneme? or something completely different?

State is subphone. In conventional scheme each phone is split on 3 states -beginning of the phone, middle of the phone and end of the phone. It is also possible to use 5 states for each phone.

how does hmm and gmm work together in different ASR systems?

GMM computes probability of every hidden state aligned to every observation. HMM is described above, computes probability of a sequence of observation aligned to sequence of hidden states.

Related Question