Solved – Difference between MLE and Baum Welch on HMM fitting

expectation-maximizationhidden markov model

In this popular question, high upvoted answer makes MLE and Baum Welch separate in HMM fitting.

For training problem we can use the following 3 algorithms: MLE (maximum likelihood estimation), Viterbi training(DO NOT confuse with Viterbi decoding), Baum Welch = forward-backward algorithm

BUT in Wikipedia, it says

The Baum–Welch algorithm uses the well known EM algorithm to find the maximum likelihood estimate of the parameters

So, what's the relationship between MLE and Baum–Welch algorithm?

My attempt: The objective for Baum–Welch algorithm is maximize likelihood, but it uses a specialized algorithm (EM) to solve the optimization. We still can maximize likelihood by using other methods such as gradient decent. This is why the answer make two algorithm separate.

Am I right and can anyone help me to clarify?

Best Answer

Refer to one of the answers (by Masterfool) from the question link you provided,

Morat's answer is false on one point: Baum-Welch is an Expectation-Maximization algorithm, used to train an HMM's parameters. It uses the forward-backward algorithm during each iteration. The forward-backward algorithm really is just a combination of the forward and backward algorithms: one forward pass, one backward pass.

And I agree with PierreE's answer here, Baum–Welch algorithm is used to solve maximum likelihood in HHM. If the states are known (supervised, labeled sequence), then other method maximizing MLE is used (maybe like, simply count the frequency of each emission and transition observed in the training data, see the slides provided by Franck Dernoncourt).

In the setting of MLE for HMM, I don't think you can just use gradient descent, since the likelihood (or, log-likelihood) doesn't have a closed-form solution and must be solved iteratively, same as the case in mixture models so then we turn to EM. (See more details in Bishop, Pattern Recognition book, chapter 13.2.1 Pg614)

Related Solutions

Solved – How to interpret Hidden Markov Model parameters (transition matrix, emission matrix, and pi values)

For MatLab, I would recommend using the HMM toolbox. It allows you to do pretty much all you would need from an HMM model.

If you feel strongly about using your own code, before running on a real dataset, you should probably validate your Baum Welch implementation by checking whether it actually returns sensible results. You can use an experimental setup similar to below. Please note that I am using the HMM toolbox functions, but it is more the order of steps that I am trying to draw your attention to.

M = 3; % number of observation levels
N  = 2; % number of states

% A - "true" parameters (of your validation model)
prior0 = normalise(rand(N ,1));
transmat0 = mk_stochastic(rand(N ,N ));
obsmat0 = mk_stochastic(rand(N ,M));

% B- using the real parameters in step A, simulate a sequence of states and corresponding observations
n_seq = 5;  % you want to generate 5 multiple sequences
seq_len= 100; % you want each sequence to be of length 100
obs_seq, state_seq = dhmm_sample(prior0, transmat0, obsmat0, n_seq, seq_len);

% C- like you say you do, generate some initial guesses of the real parameters (from step A) that you want to learn
prior1 = normalise(rand(N ,1));
transmat1 = mk_stochastic(rand(N ,N ));
obsmat1 = mk_stochastic(rand(N ,M));

% D - train based on your guesstimates using EM (Baum-Welch)
[LL, prior2, transmat2, obsmat2] = dhmm_em(data, prior1, transmat1, obsmat1, 'max_iter', 5);

% E- Finally, compare whether your trained values in step D are actually similar to the real values (that generated your data) from Step A. 
% The simplest way to do that is to print them side by side or look at the absolute differences...
obsmat0
obsmat2

transmat0
transmat2

The interpretations of the transition matrix, observation (emission) matrix and loglikelihoods is a broad topic. I assume you already have a fair understanding since you could implement your own Baum-Welch. The easiest and best read on HMMs (in my opinion) is Rabiner's paper. I would recommend having a look at this if you haven't yet.

Solved – Scaling the backward variable in HMM Baum-Welch

I don't this in itself indicates any problem. $\sum_{s \in S} \alpha_t(s)$ is the probability that the observed output sequence up to $t$ was $t_0, t_1, \dots,$ eSequence.get(t). Thus, it's fine for $c_t$ to be greater than one. Also, for instance for $\beta_{\mathrm{time}-1}$, it's $\sum_{s \in S}\beta_{\mathrm{time}-1}(s) = |S|c_{\mathrm{t}-1}$, which can very well be over one.

Best Answer

Related Solutions

Solved – How to interpret Hidden Markov Model parameters (transition matrix, emission matrix, and pi values)

Solved – Scaling the backward variable in HMM Baum-Welch

Related Question