Solved – Comparison between MDL and BIC

bichidden markov model

I'm currently studying Hidden Markov Models. There's a set of observations from which I need to determine the optimal number of states. After having found the maximum likelihood using Baum-Welch, I considered two model selection criteria for determining the optimal states. These are Minimum Description length (MDL) and Bayesian Inference criterion (BIC). However, with MDL, the number of states=2 whereas with BIC it's 4. Does this mean that MDL performs better than BIC?

Best Answer

The Bayesian Infomration Criterion (BIC) is given as:

\begin{equation}\label{eq_BIC_FINAL} BIC = \log f\left( {\bf{x}}|\hat{{\bf{\theta}}}_i ; H_i\right) - \frac{1}{2} \log \left| I\left(\hat{{\bf{\theta}}}_i \right)\right| + \frac{n_i}{2} \log 2 \pi e \overset{i}{\rightarrow} max, \end{equation}

where $i=1,\cdots,M$ is the model order index, $\left| \cdot \right|$ is the determinant, $I\left(\hat{{\bf{\theta}}}_i \right)$ is the Fisher Information Matrix for parameter ${\bf{\theta}}_i$ and $n_i$ is the number of unknown deterministic parameters under each hypothesized model.

MDL is derived directly from the BIC when $N\to \infty$ assuming i.i.d samples. Assuming $N$ i.i.d. samples we can write $I\left(\hat{\theta}_i \right) = N i\left(\hat{\theta}_i \right)$, where $i\left(\hat{\theta}_i \right)$ is the Fisher information matrix based on only one sample evaluated at $\hat{\theta}_i$. Inserting this into the BIC we get

\begin{equation}\label{eq_MDL1} \log f\left( {\bf{x}} ; H_i\right) = \log f\left( {\bf{x}} |\hat{\theta}_i ; H_i\right) - \frac{n_i}{2} \log N - \frac{1}{2} \log \left| i\left(\hat{\theta}_i \right)\right| + \frac{n_i}{2} \log 2 \pi e, \end{equation}

where $H_i$ is the hypothesized model order.

Now taking $N$ to infinity will leave only the first two terms in the above equation so we get

\begin{equation} \log f\left( {\bf{x}} ; H_i\right) = \log f\left( {\bf{x}} |\hat{\theta}_i ; H_i\right) - \frac{n_i}{2} \log N \overset{i}{\rightarrow} max. \end{equation}

Usually in the literature the signs are in the opposite direction so we wish to minimize the MDL:

\begin{equation} MDL = -\log f\left( {\bf{x}} |\hat{\theta}_i ; H_i\right) + \frac{n_i}{2} \log N \overset{i}{\rightarrow} min. \end{equation}

So, obviously if one of the assumptions made above, namely, a lot of samples and i.i.d of the samples, does not hold, MDL will not give the same results as BIC.

Related Question