Solved – Selecting the number of mixtures / hidden states / latent variables

aicgaussian mixture distributionhidden markov modellatent-variablelikelihood

My question is regarding Gaussian Mixture models, Hidden Markov models (HMM) or any type of clustering or latent variable model, for which we can devise a likelihood function.

Specifically, I train a HMM based on a temporal dataset. Assuming I have no prior knowledge of how the dataset is distributed, I am trying to find the number of hidden states ( i.e. mixture components, latent variables, etc.) that best desribes my dataset.

For this reason, I create 10 models with increasing number of states and calculate their Loglikelihood (LL) and AIC scores to evaluate my model fit. These are plotted as below:

enter image description here

My first attempt was, similar to choosing the number of clusters in a k-means setting, just use a visual elbow/knee method to spot the #state value where the corresponding score gain (LL or AIC or Distortion measure [if I were running a k-means]) plateaues.

With LL, I can see that the score gain flattens with 8 states (although it increases again with 9 states), so I can nominate 8 as a candidate according to the elbow method. Looking at AIC, which in contrast to LL takes the model complexity into account, I guess one should go for the point where delta(AIC) goes negative rather than flattens (as in an elbow approach) because AIC is already penalising for the increase in #state.

1) Do you agree that an elbow/knee method is not suitable for AIC due to the above reason?

2) Looking at the figure posted, what number of states would you pick?

3) Expanding on 2, and assuming one can calculate a likelihood function, what principled methods would you recommend in general for finding the optimal number of mixtures/states/latent variables in such a setting?

Any references very welcome.

EDIT:
@Aleksandr Blekh has kindly provided a link to an amazing SO answer on how to choose the number of clusters for k-means, which I think covers most alternatives to likelihood-based scoring approaches. The link also covers a BIC scoring example – which I will look into.

Best Answer

For the case of Gaussian Mixture Models, you can easily extend it and get a Dirichlet Process - GMM (DP-GMM or Infinite GMM). It was first proposed in this Rasmussen's paper. I like this paper very much because we introduce the concept very nicely.

The idea is going non-parametric and let the data decide the number of clusters it is more confortable with.

Imagine you decide that you have $K$ components. If $z_n$ is the cluster assignment of point $n$, you can put a prior over the assignments: $\boldsymbol{\pi} = [p_1, p_2...,p_k]$. If you put yet another prior over $\mathbf{\pi}$ you can avoid specifying them. We use a Multinomial prior and a Dirichlet prior. You would have:

$$ p(z_n | \boldsymbol{\pi}) \sim \text{Multinomial}(\mathbf{\pi})\\ p(\boldsymbol{\pi} |\alpha) \sim \text{Dirichlet}(\alpha) $$

On the other hand, you put also a prior on the latent parameters of a component $i$, $\mathbf{\theta_i}$. $$ p(\boldsymbol{\theta_i}) \sim G_0\ $$

And you also have the likelihoods to say what is the probability of the observed data point $x_n$ given it belongs to component $i$: \begin{align} p(x_n | \theta_i) \end{align}

Up to here, this is a GMM with a fixed number of components or clusters.

Now, the trick is to assume that $K$ goes to infinity [1]. If you do so, you end up with a Dirichlet Process prior and you will be able to infer the number of clusters, the parameters of every cluster and the clusters assignments.

The idea of the Dirichlet Process as a base for non-parametrical models is quite popular. I think I've seen something applied to HMM, but I'm not sure about that.

[1] I wrote the maths explicitly here, but it's only a draft, I would like to add some images and examples.

PS: See also this popular post: http://blog.echen.me/2012/03/20/infinite-mixture-models-with-nonparametric-bayes-and-the-dirichlet-process/