What you can do, to fit it into the classical HMM framework, is define a new hidden state space equal to the product of your current two state space and the output space. Given five possible outputs, this will result in a state space with 10 states. Let us define $S_t$ as the state at time $t$ and $O_t$ as the output at time $t$, which I'll assume can take on values $1, 2, \dots, 5$ for simplicity. $S_t = 1$ might correspond to your original state 1 and $O_{t-1} = 1$, $S_t=2$ to original state 1 and $O_{t-1}=2$, and so forth.
Transition probabilities - The transitions between states $S_t$ and $S_{t+1}$ are very limited once you've seen the output at time $t$, as there are only two states in the state space which it is possible to transition into. You can retain the simplicity of your current transition matrix by making the assumption that the transition probability from $S_t$ to $S_{t+1}$ depends solely on the "part" of $S_t$ that is due to your original definition of state. If you make this assumption, which seems to be what you want to do, you will only have two transition probabilities to estimate, and consequently will not suffer from an explosion of the number of transition probabilities in your model.
According to this Wikipedia article, there are many inference benefits you gain from a HMM. The first inference is the ability to assign a probability to any observation sequence $\mathbf{Y} = (Y_1,\ldots, Y_N)$ by marginalizing over the set of all possible hidden state sequences $\mathbf{X} = (X_1,\ldots, X_N)$:
$P(\mathbf{Y}) = \sum_{\mathbf{X}} P(\mathbf{X}) P( \mathbf{Y} \vert \mathbf{X} )$
This way, you can assign probabilities to observation sequences even in an on-line manner as observations arrive (using the very efficient forward algorithm). An anomaly is an observation that is (relatively) highly unlikely according $P(\mathbf{Y})$ (a threshold can be used to decide). Of course, the value of $P(\mathbf{Y})$ grows smaller and smaller as $N$ increases. Many methods can be used to renormalize $P(\mathbf{Y})$ to keep it within the representable range of floating-point data types and enable meaningful thresholding. For example, we might use the following as an anomaly measure:
$\mathbb{A}_{N} = \log P(Y_N \vert Y_1,\ldots, Y_{N-1}) = \log \frac{P(Y_1,\ldots, Y_N)}{P(Y_1,\ldots, Y_{N-1})}$
$\mathbb{A}_{N} = \log P(Y_1,\ldots, Y_N) - \log P(Y_1,\ldots, Y_{N-1})$
Best Answer
Markov Switching Models are the same thing as Regime Switching Models. A Hidden Markov Switching Model or a Hidden Regime Switching Model (both of which are commonly called a Hidden Markov Model) is different.
A Hidden Markov Model (HMM) is a doubly stochastic process. There is an underlying stochastic process that is not observable (hidden), the results of which can be observed (these results being the second stochastic process). The underlying stochastic process that is hidden is what makes this model different.
Consider this coin toss example starting on page 5: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=1165342
If the man behind the curtain flips 1 coin and tells you the results, you have 2 states (state 1: heads, state 2: tails). Thus, by knowing the results of each coin flip, you can determine the state. Therefore, you cannot construct a HMM.
If the man behind the curtain flips 2 coins and tells you the results, you still have 2 states (state 1: coin 1, state 2: coin 2) but they are not uniquely tied to heads or tails. This is a hidden process because you can't tell which state (coin 1 or coin 2) led to the observation (heads or tails). Therefore, you can construct a HMM.