Solved – Implementing EM clustering for a mixture markov model

expectation-maximizationmarkov chainmixture-distributionmodel-based-clusteringr

I have a mixture Markov model (containing K clusters, or components) that I am trying to train, e.g perform clustering over a set of varying length sequences. Each component of the model is a first order Markov chain. The model is defined by the following:

enter image description here

where

enter image description here

is the model describing behaviour of sequences in the k-th cluster.

Each kth model component(cluster) has initial probability distribution vector (theta_KI) and a transition matrix (theta_tK) – see equation 2. Each sequence seq_i in the set of sequences has a different length L (so the L in the 2nd equation would be different for each sequence in the set).

There is n possible visible states in the model and m sequences in each cluster, and the total amount of sequences is M.

In this case, how would I define e- and m-steps for my EM algorithm? My understanding is that I have to maximize log likelihood function of the model, which would be a sum of likelihoods across all vectors v_i ( synonym with sequence seq_i)- something like:

enter image description here

However, I get stuck not knowing if this is the right definition for what needs to be maximized and which arguments need to be found for that. I have looked for literature but it is very sparse, almost nonexistent.

Another question: is there actually an R library that has functions that can be extended to include such a model for EM algorithm?

Best Answer

As you say, the EM is an algorithm aiming at maximizing the log-likelihood of your model. The fact that the sequences have different lengths is not a problem. To see why let us first assume that you have trained your model. What does this mean?. That you have estimates for the transition probabilities, $\hat{p}(v_{i}|v_{i-1};\theta_{K})$ and for the priors $\hat{p}(c_{K};\theta_{K})$. How do you perform classification in such a model? You take your sequence, for example $v_{2}, v_{7}, v_{12}$ and calculate,

$$\mathbf{argmax}_{K} \hat{p}(c_{K};\theta_{K})p(v_{2};\theta_{K}) \hat{p}(v_{7}|v_{2};\theta_{K}) \hat{p}(v_{12}|v_{7};\theta_{K})$$

Hence, whether you have longer or shorter sequences does not change anything substantially. You need the same amount of parameters for each cluster. Which, without further simplification, would be the number of free parameters for the transition matrix, $N*(N-1)$, plus 1 for the prior.

The EM algorithm would be as follows: start with some values for the transition matrix (you could try generating a random matrix such that corresponds to a transition matrix) and for the $\theta_{i}'s$.

  1. Expectation: estimate the priors. You may do this by calculating the maximum likelihood estimator, i.e. the number of sequences belonging to each cluster divided by the total number of sequences. A sequence belongs to cluster $K$ if $K = \mathbf{argmax}_{j}p(v|c_{j}, \theta)$.
  2. Maximization: based on the resulting assignment of sequences to clusters, estimate the transition probabilities. Here, some smoothing technique, like Laplace smoothing might be necessary in order to ensure numerical stability.

You go on like this until convergence, that is, until the value of the likelihood does not further improve.

Regretfully I am not familiar with R.