Maximum Likelihood – Can Log-Likelihood Function Values Decrease After One EM Iteration?

dirichlet distributionexpectation-maximizationmarkov chainmaximum likelihoodmixture-distribution

I am applying a MAP log-likelihood approach in order to fit a Markov mixture model, where objective function to be maximized is given by the formula:

$$
L(X|\Theta _K)=\sum_{i=1}^{n}f(X_i|\Theta_K)+\sum_{j=1}^{K}\sum_{n=0}^{M}\log p(\theta_n^{j}|a_n^{j})
$$

where the second argument is a sum of Dirichlet priors ( I am using the formula given by Wikipedia) and the first argument is the sum of log-likelihood across all sequences and all components.

At this point I have achieved a lot in implementing the algorithm in R, thanks to answers to my previously posted questions related to topic.

At this stage my question is – after performing one step of expectation-maximization algorith ( with 2 components to start with), my value of $L(X|\Theta_K)$ became much smaller than it was ( from -2200 to -8000). I believe that my code is correct and do not understand why this could be happening ( the next 2 steps show steady increase). Can there be fluctuation of the algorithm in the beginning?

There are 2 possible issues, however I cannot pinpoint if they are indeed the causing this: underflow problem(some of the values in my transition matrix and the resulting likelihood values, can be so negligibly small that in certain calculations R, in which I am working, rounds them up and thus I end up with incorrect calculations; or meaningless Dirichlet priors ( e.g. sometimes the priors are larger than 1).

Additionally: the resulting from M-step multinomial distributions for each row of the transition matrix and start probabilities vector sum to 1 for each of the components. The posterior conditional probabilities of the hidden variables ( components ) also seem to make sense.

The example of conditional posteriors of hidden variables for first ([[1]]) and second [[2]] component for each of 50 sequences are below ( results of first iteration E-step):

    [[1]]
     [1] 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00   1.520433e-10 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00
    [11] 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00  1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00
    [21] 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00
    [31] 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 2.726330e-02 1.000000e+00 1.000000e+00
    [41] 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.979822e-01 1.000000e+00 1.000000e+00

     [[2]]
     [1]  7.324479e-65  2.462187e-97  4.146568e-32  3.498317e-97 2.135013e-274  1.000000e+00  7.731884e-47 2.068553e-264 8.497501e-260
     [10] 4.356689e-271 2.983088e-213 7.485688e-110 4.556750e-287 1.360219e- 173  2.340220e-45  2.609916e-59  6.057344e-59 1.286382e-185
     [19]  3.879706e-80 8.488843e-188  1.881308e-14 3.098226e-290 1.681928e-290 1.168018e-211 2.123491e-292 5.767748e-177 9.232827e-198
    [28] 1.120970e-159 1.397181e-257  4.078388e-48 1.524531e-247  1.000000e+00 1.833904e-252 1.452165e-263 1.878481e-111 3.379251e-178
    [37] 1.823290e-247  9.727367e-01 8.553065e-238  2.748773e-32 4.602824e-138  1.212533e-93 1.744806e-271 2.677587e-292 7.822883e-131
     [46] 2.504779e-111  1.775144e-16  8.020178e-01  1.301381e-63 4.437099e-114

Best Answer

In applications of the expectation-maximization algorithm, the likelihood of the data should monotonically increase.

As discussed by R. Neal and G. Hinton the EM algorithm can be seen as (effectively) a gradient ascent algorithm in where the data likelihood is the objective function expressed as a function of the model parameters..

The main case I can envision where this type of problem would arise (aside from implementation bugs) is when one is using an approximation technique to solve for the values of the parameters that yield the expectation and/or optimization. For example, if one has to, due to the structure of the distributions in question, use approximations for the values of the parameters that maximize the likelihood; then these approximations may allow one to "jump across" a local maximum, just like in any other application of gradient-following algorithms.

For the specific case of apply the EM algorithm for estimating a two-component discrete Markov model, you should be able to evaluate the expectations and do the maximizations exactly, so you shouldn't be having this problem.

Related Question