The Bayesian Infomration Criterion (BIC) is given as:
\begin{equation}\label{eq_BIC_FINAL}
BIC = \log f\left( {\bf{x}}|\hat{{\bf{\theta}}}_i ; H_i\right) - \frac{1}{2} \log \left| I\left(\hat{{\bf{\theta}}}_i \right)\right| + \frac{n_i}{2} \log 2 \pi e \overset{i}{\rightarrow} max,
\end{equation}
where $i=1,\cdots,M$ is the model order index, $\left| \cdot \right|$ is the determinant, $I\left(\hat{{\bf{\theta}}}_i \right)$ is the Fisher Information Matrix for parameter ${\bf{\theta}}_i$ and $n_i$ is the number of unknown deterministic parameters under each hypothesized model.
MDL is derived directly from the BIC when $N\to \infty$ assuming i.i.d samples. Assuming $N$ i.i.d. samples we can write $I\left(\hat{\theta}_i \right) = N i\left(\hat{\theta}_i \right)$, where $i\left(\hat{\theta}_i \right)$ is the Fisher information matrix based on only one sample evaluated at $\hat{\theta}_i$. Inserting this into the BIC we get
\begin{equation}\label{eq_MDL1}
\log f\left( {\bf{x}} ; H_i\right) = \log f\left( {\bf{x}} |\hat{\theta}_i ; H_i\right) - \frac{n_i}{2} \log N - \frac{1}{2} \log \left| i\left(\hat{\theta}_i \right)\right| + \frac{n_i}{2} \log 2 \pi e,
\end{equation}
where $H_i$ is the hypothesized model order.
Now taking $N$ to infinity will leave only the first two terms in the above equation so we get
\begin{equation}
\log f\left( {\bf{x}} ; H_i\right) = \log f\left( {\bf{x}} |\hat{\theta}_i ; H_i\right) - \frac{n_i}{2} \log N \overset{i}{\rightarrow} max.
\end{equation}
Usually in the literature the signs are in the opposite direction so we wish to minimize the MDL:
\begin{equation}
MDL = -\log f\left( {\bf{x}} |\hat{\theta}_i ; H_i\right) + \frac{n_i}{2} \log N \overset{i}{\rightarrow} min.
\end{equation}
So, obviously if one of the assumptions made above, namely, a lot of samples and i.i.d of the samples, does not hold, MDL will not give the same results as BIC.
The following is quoted from the Scholarpedia website:
State space model (SSM) refers to a class of probabilistic graphical model (Koller and Friedman, 2009) that describes the probabilistic dependence between the latent state variable and the observed measurement. The state or the measurement can be either continuous or discrete. The term “state space” originated in 1960s in the area of control engineering (Kalman, 1960). SSM provides a general framework for analyzing deterministic and stochastic dynamical systems that are measured or observed through a stochastic process. The SSM framework has been successfully applied in engineering, statistics, computer science and economics to solve a broad range of dynamical systems problems. Other terms used to describe SSMs are hidden Markov models (HMMs) (Rabiner, 1989) and latent process models. The most well studied SSM is the Kalman filter, which defines an optimal algorithm for inferring linear Gaussian systems.
Best Answer
I'm assuming here that your output variable is categorical, though that may not be the case. Typically though, when I've seen HMM's used, the number of states is known in advance rather than selected through tuning. Usually they correspond to some well-understood variable that happens to not be observed. But that doesn't mean you can't experiment with it.
The danger in using BIC (and AIC) though is that the k value for the number of free parameters in the model increases quadratically with the number of states because you have the transition probability matrix with Px(P-1) parameters (for P states) and the output probabilities for each category of the output given each state. So if the AIC and BIC are being calculated properly, the k should be going up fast.
If you have enough data, I would recommend a softer method of tuning the number of states like testing on a holdout sample. You might also want to just look at the likelihood statistic and visually see at what point it plateaus. Also if your data is large, keep in mind that this will push the BIC to a smaller model.