Solved – How to handle high dimensional feature vector in probability graph model

feature selectionfeature-engineeringhidden markov modellarge datamachine learning

I was doing some NLP related stuff which involves training a hidden Markov model, and use the model to segment sentences. For every sentence, I translate the tokens into feature vectors. The features are manually picked by me, and I can only think of 20 features temporarily. All of the features are binary. So an example feature vector in my case would be like:

[0, 0, 1, 1, 0…] 

When training the HMM (supervised learning with maximum likelihood estimation), I convert the binary feature vector to integer, and use the integers as the observations in HMM. My question is what if I have more dimension of features, say 100 features, then it might not be possible or efficient to map feature vectors to integers when training the HMM. What if the features are not binary. What are some common solutions/best practice to these problems?

P.S. I did some research and find that one solution is to use K-Means to find clusters of feature vectors and consider each cluster as a single observation in HMM (also known as vector quantization). But I don't think it is the best solution.

Best Answer

A Hidden Markov Model is defined by two different probability distributions, namely

$$ \begin{align*} p(s_t \mid s_{t-1}),&\;\;\;\text{the transition probabilities, and}\\ p(x_t \mid s_t),&\;\;\;\text{the emission probabilities.} \end{align*} $$ In typical presentations of HMMs the emission probabilities are taken to be categorical as this is the natural choice if each observation in the sequence is a word.

You can easily modify the emission distribution to deal with other types of observations. For example, if your observation $x_t$ is a continuous vector then it may make sense to assume $p(x_t|s_t)$ is a multivariate normal distribution with state dependent mean vector and covariance matrix, i.e.,

$$ x_t \mid s_t \sim N(\mu_{s_t}, \Sigma_{s_t}). $$

On the other hand if $x_t$ is a binary vector you may want to make a conditional independence assumption (as you would for Naive Bayes') and assume

$$ p(x_t\mid s_t) = \prod_{i=1}^n p(x_{ti}\mid s_t) = \prod_{i=1}^n \left(\theta_{s_ti}\mathbb{I}\left\{x_{ti} = 1\right\} + (1-\theta_{s_ti})\mathbb{I}\left\{x_{ti} = 0\right\}\right). $$

These are just some common distributions you'll see people work with. You can really use anything you want, provided you can compute the values you need and learn the parameters.