Solved – Hidden state models vs. stateless models for time series regression

hidden markov modelmodelingpredictionregressiontime series

This is a quite generic question: assume I want to build a model to predict the next observation based on the previous $N$ observations ($N$ can be a parameter to optimize experimentally). So we basically have a sliding window of input features to predict the next observation.

I can use a Hidden Markov Model approach, i.e. Baum-Welch to estimate a model, then Viterbi to predict a current state based on the last $N$ observations, then predict the most likely next state based on the current state, and then predict the next observation using the most likely next state and the HMM parameters (or variants such as find the predictive distribution of the next observation).

Or I can use a much simpler approach, using a stateless model (which can get as input the previous $N$ observations), e.g. SVM, linear regression, splines, regression trees, nearest neighbors, etc. Such models are based on minimizing some prediction error over the training set and are therefore, conceptually, much simpler than a hidden state based model.

Can someone share her/his experience in dealing with such a modelling choice? What would speak in favour of the HMM and what in favour of a regression approach? Intuitively one should take the simpler model possible to avoid over-fitting; this speaks in favour of a stateless approach…We also have to consider that both approaches get the same input data for training (I think this implies that if we do not incorporate additional domain knowledge in the modelling of a hidden state model, e.g. fix certain states and transition probabilities, there is no reason of why a hidden state model should perform better). At the end one can of course play with both approaches and see what performs better on a validation set, but some heuristics based on practical experience might also be helpful…

Note: for me it is important to predict only certain events; I prefer a model which predicts few "interesting/rare" events well, rather than a model which predicts "average/frequent" events but the interesting ones not so well . Perhaps this has an implication for the modelling choice.
Thanks.

Best Answer

In short, I think they are working in different learning paradigm.

State-space model (hidden state model) and other stateless model you mentioned are going to discover the underlying relationship of your time series in different learning paradigm: (1) maximum-likelihood estimation, (2) Bayes' inference, (3) empirical risk minimization.

In state-space model,

Let $x_t$ as the hidden state, $y_t$ as the observables, $t>0$ (assume there is no control)

You assume the following relationship for the model:

$P(x_0)$ as a prior

$P(x_t | x_{t-1})$ for $t \geq 1$ as how your state change (in HMM, it is a transition matrix)

$P(y_t | x_t)$ for $t \geq 1$ as how you observe (in HMM, it could be normal distributions that conditioned on $x_t$)

and $y_t$ only depends on $x_t$.

When you use Baum-Welch to estimate the parameters, you are in fact looking for a maximum-likelihood estimate of the HMM. If you use Kalman filter, you are solving a special case of Bayesian filter problem (which is in fact an application of Bayes' theorem on update step):

Prediction step:

$\displaystyle P(x_t|y_{1:t-1}) = \int P(x_t|x_{t-1})P(x_{t-1}|y_{1:t-1}) \, dx_{t-1}$

Update step:

$\displaystyle P(x_t|y_{1:t}) = \frac{P(y_t|x_t)P(x_t|y_{1:t-1})}{\int P(y_t|x_t)P(x_t|y_{1:t-1}) \, dx_t}$

In Kalman filter, since we assume the noise statistic is Gaussian and the relationship of $P(x_t|x_{t-1})$ and $P(y_t|x_t)$ are linear. Therefore you can write $P(x_t|y_{1:t-1})$ and $P(x_t|y_{1:t})$ simply as the $x_t$ (mean + variance is sufficient for normal distribution) and the algorithm works as matrix formulas.

On the other hand, for other stateless model you mentioned, like SVM, splines, regression trees, nearest neighbors. They are trying to discover the underlying relationship of $(\{y_0,y_1,...,y_{t-1}\}, y_t)$ by empirical risk minimization.

For maximum-likelihood estimation, you need to parametrize the underlying probability distribution first (like HMM, you have the transition matrix, the observable are $(\mu_j,\sigma_j)$ for some $j$)

For application of Bayes' theorem, you need to have "correct" a priori $P(A)$ first in the sense that $P(A) \neq 0$. If $P(A)=0$, then any inference results in $0$ since $P(A|B) = \frac{P(B|A)P(A)}{P(B)}$.

For empirical risk minimization, universal consistency is guaranteed for any underlying probability distribution if the VC dimension of the learning rule is not growing too fast as the number of available data $n \to \infty$