Solved – Baum-Welch and hidden Markov models: Continuous observation densities in HMMs

baum-welchhidden markov modeloptimization

I am currently trying to understand how parameter are being reestimated for hidden Markov models (HMMs), using expectation-maximization (EM).

What I seem to have problems understanding is what the symbol emission probability is actually modelling. In the discrete case it would contain the probability of seeing each symbol in a given state, which analog to in continuous would be how probable a it is to see a continuous stream of observation for a given state.

The Gaussian mixture model that models this probability distribution is defined by the parameters $c_{jk},\mu_{jk},\Sigma_{jk}$ for each state, in which $c$ is each weight of all the PDF the mixture contains (indexed by $k$) and for each state $j$, and similar to $\mu$ and $\Sigma$.

And reestimation of these parameter is defined as such

\begin{equation}\tag{1}\label{1}
\widetilde{c}_{jk} = \frac{\sum_{t=1}^{T}\gamma_{jk}(t)}{\sum_{t=1}^{T}\sum_{k=1}^{M}\gamma_{jk}(t)}
\end{equation}

\begin{equation}\tag{2}\label{2}
\widetilde{\mu}_{jk} = \frac{\sum_{t=1}^{T}\gamma_{jk}(t) \cdot \boldsymbol{o_t}}{\sum_{t=1}^{T}\gamma_{jk}(t)}
\end{equation}

\begin{equation}\tag{3}\label{3}
\widetilde{\Sigma}_{jk} = \frac{\sum_{t=1}^{T} \gamma_{jk}(t) \cdot (\boldsymbol{o}_t – \boldsymbol{\mu}_{jk})(\boldsymbol{o}_t – \boldsymbol{\mu}_{jk})^T}{\sum_{t=1}^{T}\gamma_{jk}(t)}
\end{equation}

$\gamma_{jk}(t)$ is the probability of being in state $j$ at time $t$ with the $k$'th mixture.

Equation \eqref{1} makes sense…

Equation \eqref{1} describes the re estimate formula for $c_{jk}$, which is the ratio between the expected the number of times the system is in state $j$ using the $k$'th mixture, and the expected number of times the system is in state $j$.
Which makes sense and it would look like it does.

What I don't get is the other equation. Why are they defined as such? It is said that the observations weight each numerator term, but how does that help making it closer to the oberservation mean?

Similarly with the covariance matrix…

And how and why is the $\gamma_{t}(j,k)$ defined as it is defined..

It is stated in pdf page 351 to be fairly straight forward?
I am not fairly agreeing with them..

Best Answer

Not super familiar with these models in this form, but:

The $\widetilde{\mu}_{jk}$ and $\widetilde{\Sigma}_{jk}$ look to be a weighted "sample" mean and weighted "sample" covariance.

For $\widetilde{\mu}_{jk}$, it's a weighted average of the $\boldsymbol{o}_t$ terms, with each term weighted by $\frac{\gamma_{jk}(t)}{\sum_{t=1}^{T}\gamma_{jk}(t)}$.

For example, if the first 10 terms were somewhat likely to be in $j,k$, e.g. $\gamma_{jk}(1) \approx \gamma_{jk}(2) \approx ... \approx \gamma_{jk}(10) \approx 0.3$, but all terms beyond the 10th are very unlikely, $\gamma_{jk}(t) \approx 0.02$ for $t$ from 11 to 100, then our best estimate for $\mu_{jk}$ will be mainly based on those first 10 terms (something like $\frac{3}{4.8}$ weighting for the first 10).

For another example, if the only time that we think state $j$ had any weight from the $k$th measure is $t = 1$, with say $\gamma_{jk}(1) = 0.1$, but all other times we have $\gamma_{jk}(t) \approx 0$, $t \geq 2$, then we'll have $\widetilde{\mu}_{jk} \approx \boldsymbol{o}_1$.

The $\widetilde{\Sigma}_{jk}$ terms have the same idea, except instead of plain $\boldsymbol{o}_t$, it's a weighted average of $(\boldsymbol{o}_t - \boldsymbol{\mu}_{jk})(\boldsymbol{o}_t - \boldsymbol{\mu}_{jk})^T$ terms. If you're familiar with sample covariance matrices, this should look similar; if not, this will look similar soon.

As for the $\gamma_{jk}(t)$ terms (I assume you mean from your linked doc): the left term is the same as in the discrete case, it's the probability of the $j$th state. The right side is the density of the $k$th mixture relative to the sum of all the densities (i.e. over all mixtures in state $j$). Which looks like $Pr_t(j,k) = Pr_t(j) Pr_t(k \mid j)$, following the general formula $Pr(A, B) = Pr(A) Pr(B \mid A)$.

To try and give an overview: your $j,k$ probabilities are in one respect similar to the discrete case, in terms of $j$, and have a new extra part, mixtures of normals within a state.

If you know these probabilities, you can turn that around and estimate the parameters of the mixtures by computing weighted estimates. Is this what you were going for?

Related Question