What you can do, to fit it into the classical HMM framework, is define a new hidden state space equal to the product of your current two state space and the output space. Given five possible outputs, this will result in a state space with 10 states. Let us define $S_t$ as the state at time $t$ and $O_t$ as the output at time $t$, which I'll assume can take on values $1, 2, \dots, 5$ for simplicity. $S_t = 1$ might correspond to your original state 1 and $O_{t-1} = 1$, $S_t=2$ to original state 1 and $O_{t-1}=2$, and so forth.
Transition probabilities - The transitions between states $S_t$ and $S_{t+1}$ are very limited once you've seen the output at time $t$, as there are only two states in the state space which it is possible to transition into. You can retain the simplicity of your current transition matrix by making the assumption that the transition probability from $S_t$ to $S_{t+1}$ depends solely on the "part" of $S_t$ that is due to your original definition of state. If you make this assumption, which seems to be what you want to do, you will only have two transition probabilities to estimate, and consequently will not suffer from an explosion of the number of transition probabilities in your model.
It will be helpful to distinguish the model from inference you want to make with it, because now standard terminology mixes the two.
The model is the part where you specify the nature of: the hidden space (discrete or continuous), the hidden state dynamics (linear or non-linear) the nature of the observations (typically conditionally multinomial or Normal), and the measurement model connecting the hidden state to the observations. HMMs and state space models are two such sets of model specifications.
For any such model there are three standard tasks: filtering, smoothing, and prediction. Any time series text (or indeed google) should give you an idea of what they are. Your question is about filtering, which is a way to get a) a posterior distribution over (or 'best' estimate of, for some sense of best, if you're not feeling Bayesian) the hidden state at $t$ given the complete set of of data up to and including time $t$, and relatedly b) the probability of the data under the model.
In situations where the state is continuous, the state dynamics and measurement linear and all noise is Normal, a Kalman Filter will do that job efficiently. Its analogue when the state is discrete is the Forward Algorithm. In the case where there is non-Normality and/or non-linearity, we fall back to approximate filters. There are deterministic approximations, e.g. an Extended or Unscented Kalman Filters, and there are stochastic approximations, the best known of which being the Particle Filter.
The general feeling seems to be that in the presence of unavoidable non-linearity in the state or measurement parts or non-Normality in the observations (the common problem situations), one tries to get away with the cheapest approximation possible. So, EKF then UKF then PF.
The literature on the Unscented Kalman filter usually has some comparisons of situations when it might work better than the traditional linearization of the Extended Kalman Filter.
The Particle Filter has almost complete generality - any non-linearity, any distributions - but it has in my experience required quite careful tuning and is generally much more unwieldy than the others. In many situations however, it's the only option.
As for further reading: I like ch.4-7 of Särkkä's Bayesian Filtering and Smoothing, though it's quite terse. The author makes has an online copy available for personal use. Otherwise, most state space time series books will cover this material. For Particle Filtering, there's a Doucet et al. volume on the topic, but I guess it's quite old now. Perhaps others will point out a newer reference.
Best Answer
It is better to read Rabiner's tutorial, it provides enough information and not very complex to comprehend.
Hidden markov model is not a function. It is a "model". It describes how to estimate a probability of the alignment of the observable sequence and hidden sequence. You can not just feed the observation in.
In speech recognition you find most probable sequence of hidden states. For that you consider all possible hidden state sequences and all possible alignments between hidden state and observable state and for every alignment you compute the probability of the alignment. Then you take the most probable alignment as a result.
There are quite efficient approaches, but it still takes a lot of time. Not just due to lexicon size, but also due to the fact you have to consider many possible alignments.
State is subphone. In conventional scheme each phone is split on 3 states -beginning of the phone, middle of the phone and end of the phone. It is also possible to use 5 states for each phone.
GMM computes probability of every hidden state aligned to every observation. HMM is described above, computes probability of a sequence of observation aligned to sequence of hidden states.