Maximum Likelihood – Can Log-Likelihood Function Values Decrease After One EM Iteration?

dirichlet distributionexpectation-maximizationmarkov chainmaximum likelihoodmixture-distribution

I am applying a MAP log-likelihood approach in order to fit a Markov mixture model, where objective function to be maximized is given by the formula:

$$
L(X|\Theta _K)=\sum_{i=1}^{n}f(X_i|\Theta_K)+\sum_{j=1}^{K}\sum_{n=0}^{M}\log p(\theta_n^{j}|a_n^{j})
$$

where the second argument is a sum of Dirichlet priors ( I am using the formula given by Wikipedia) and the first argument is the sum of log-likelihood across all sequences and all components.

At this point I have achieved a lot in implementing the algorithm in R, thanks to answers to my previously posted questions related to topic.

At this stage my question is – after performing one step of expectation-maximization algorith ( with 2 components to start with), my value of $L(X|\Theta_K)$ became much smaller than it was ( from -2200 to -8000). I believe that my code is correct and do not understand why this could be happening ( the next 2 steps show steady increase). Can there be fluctuation of the algorithm in the beginning?

There are 2 possible issues, however I cannot pinpoint if they are indeed the causing this: underflow problem(some of the values in my transition matrix and the resulting likelihood values, can be so negligibly small that in certain calculations R, in which I am working, rounds them up and thus I end up with incorrect calculations; or meaningless Dirichlet priors ( e.g. sometimes the priors are larger than 1).

Additionally: the resulting from M-step multinomial distributions for each row of the transition matrix and start probabilities vector sum to 1 for each of the components. The posterior conditional probabilities of the hidden variables ( components ) also seem to make sense.

The example of conditional posteriors of hidden variables for first ([[1]]) and second [[2]] component for each of 50 sequences are below ( results of first iteration E-step):

    [[1]]
     [1] 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00   1.520433e-10 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00
    [11] 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00  1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00
    [21] 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00
    [31] 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 2.726330e-02 1.000000e+00 1.000000e+00
    [41] 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.979822e-01 1.000000e+00 1.000000e+00

     [[2]]
     [1]  7.324479e-65  2.462187e-97  4.146568e-32  3.498317e-97 2.135013e-274  1.000000e+00  7.731884e-47 2.068553e-264 8.497501e-260
     [10] 4.356689e-271 2.983088e-213 7.485688e-110 4.556750e-287 1.360219e- 173  2.340220e-45  2.609916e-59  6.057344e-59 1.286382e-185
     [19]  3.879706e-80 8.488843e-188  1.881308e-14 3.098226e-290 1.681928e-290 1.168018e-211 2.123491e-292 5.767748e-177 9.232827e-198
    [28] 1.120970e-159 1.397181e-257  4.078388e-48 1.524531e-247  1.000000e+00 1.833904e-252 1.452165e-263 1.878481e-111 3.379251e-178
    [37] 1.823290e-247  9.727367e-01 8.553065e-238  2.748773e-32 4.602824e-138  1.212533e-93 1.744806e-271 2.677587e-292 7.822883e-131
     [46] 2.504779e-111  1.775144e-16  8.020178e-01  1.301381e-63 4.437099e-114

Best Answer

In applications of the expectation-maximization algorithm, the likelihood of the data should monotonically increase.

As discussed by R. Neal and G. Hinton the EM algorithm can be seen as (effectively) a gradient ascent algorithm in where the data likelihood is the objective function expressed as a function of the model parameters..

The main case I can envision where this type of problem would arise (aside from implementation bugs) is when one is using an approximation technique to solve for the values of the parameters that yield the expectation and/or optimization. For example, if one has to, due to the structure of the distributions in question, use approximations for the values of the parameters that maximize the likelihood; then these approximations may allow one to "jump across" a local maximum, just like in any other application of gradient-following algorithms.

For the specific case of apply the EM algorithm for estimating a two-component discrete Markov model, you should be able to evaluate the expectations and do the maximizations exactly, so you shouldn't be having this problem.

Related Solutions

Solved – Implementing EM clustering for a mixture markov model

As you say, the EM is an algorithm aiming at maximizing the log-likelihood of your model. The fact that the sequences have different lengths is not a problem. To see why let us first assume that you have trained your model. What does this mean?. That you have estimates for the transition probabilities, $\hat{p}(v_{i}|v_{i-1};\theta_{K})$ and for the priors $\hat{p}(c_{K};\theta_{K})$. How do you perform classification in such a model? You take your sequence, for example $v_{2}, v_{7}, v_{12}$ and calculate,

$$\mathbf{argmax}_{K} \hat{p}(c_{K};\theta_{K})p(v_{2};\theta_{K}) \hat{p}(v_{7}|v_{2};\theta_{K}) \hat{p}(v_{12}|v_{7};\theta_{K})$$

Hence, whether you have longer or shorter sequences does not change anything substantially. You need the same amount of parameters for each cluster. Which, without further simplification, would be the number of free parameters for the transition matrix, $N*(N-1)$, plus 1 for the prior.

The EM algorithm would be as follows: start with some values for the transition matrix (you could try generating a random matrix such that corresponds to a transition matrix) and for the $\theta_{i}'s$.

Expectation: estimate the priors. You may do this by calculating the maximum likelihood estimator, i.e. the number of sequences belonging to each cluster divided by the total number of sequences. A sequence belongs to cluster $K$ if $K = \mathbf{argmax}_{j}p(v|c_{j}, \theta)$.
Maximization: based on the resulting assignment of sequences to clusters, estimate the transition probabilities. Here, some smoothing technique, like Laplace smoothing might be necessary in order to ensure numerical stability.

You go on like this until convergence, that is, until the value of the likelihood does not further improve.

Regretfully I am not familiar with R.

Solved – Why isn’t the EM algorithm increasing the log-likelihood after each iteration

Problem Solved

The decreasing Likelihood was due to a coding problem. The accumulation of $lik$ is not only in the for loop for the data samples but also in the for loop for E-M step. This leads to adding the $lik$ of current EM-step onto the $lik$ of the previous EM-step which of course keeps decreasing $lik$. I am attaching the revised code here for anyone interested. No restriction on how you might use it.

function net=PPLS_EM(X,Y,p,iter,tol)

net=struct('type','PPLS_EM','C',[],'P',[],'Ex',[],'Ey',[],'LL',[]);
[N,~]=size(X); %N by p input data matrix

LL=[];
lik=0;

YY=diag(sum(Y.*Y)');
XX=diag(sum(X.*X)');

if nargin<5
tol=0.0001;
end

if nargin<4
iter=100;
end

if nargin<3
p=2;
end


%% Initilization

C=ones(dout,p);
Ey=ones(dout,1);
Ey=diag(Ey);

P=ones(din,p);
Ex=ones(din,1);
Ex=diag(Ex);

for t=1:iter
%% E-step
oldlik=lik;
[lik,Vtt,Etxsum,Etysum]=PPLS_E(C,P,Ex,Ey,X,Y);
if (t<=2)
    likbase=lik;
elseif (lik<oldlik)
    %fprintf('Oops!');
elseif ((lik-likbase)<(1 + tol)*(oldlik-likbase)||~isfinite(lik))
    break;
end;

LL=[LL lik];

%% M-step
C=Etysum'*inv(Vtt);
P=Etxsum'*inv(Vtt);
Ex=diag(diag(XX-(P*Etxsum)))/N;
Ey=diag(diag(YY-(C*Etysum)))/N;
end

net.C=C;
net.P=P;
net.Ex=Ex;
net.Ey=Ey;
net.LL=LL;

And the code for PPLS_E

    function [lik,Vtt,Etxsum,Etysum]=PPLS_E(C,P,Ex,Ey,X,Y)

    [N,din]=size(X);
    [~,dout]=size(Y);

    const1=(2*pi)^(-din/2);
    const2=(2*pi)^(-dout/2);

    lik=0;
    p=size(P,2);

    Vcur=zeros(p,p,N);
    tcur=zeros(p,N);

    tiny=exp(-700);
    I=eye(p);

    Ex=Ex+(Ex==0)*tiny;
    Ey=Ey+(Ey==0)*tiny;

    Etxsum=zeros(p,din);
    Etysum=zeros(p,dout);
    Vtt=0;

    Sigma_XY=I+C'*inv(Ey)*C+P'*inv(Ex)*P;
    for n=1:N
        tcur(:,n)=inv(Sigma_XY)*(P'*inv(Ex)*X(n,:)'+C'*inv(Ey)*Y(n,:)');
        Vcur(:,:,n)=inv(Sigma_XY)+tcur(:,n)*tcur(:,n)';

        Etx=tcur(:,n)*X(n,:);
        Ety=tcur(:,n)*Y(n,:);

        Etxsum=Etxsum+Etx;
        Etysum=Etysum+Ety;

        Vtt=Vtt+Vcur(:,:,n);

        % calculate likelihood
        Ydiff=Y(n,:)'-C*tcur(:,n);
        Xdiff=X(n,:)'-P*tcur(:,n);
        inx=inv(Ex+P*Vcur(:,:,n)*P');
        iny=inv(Ey+C*Vcur(:,:,n)*C');
        detiEx=sqrt(det(inx));
        detiEy=sqrt(det(iny));
        if (isreal(detiEx) && detiEx>0 && isreal(detiEy) && detiEy>0)
            lik=lik+log(detiEx)-0.5*sum(Xdiff'.*(Xdiff'*inx))+log(detiEy)-0.5*sum(Ydiff'.*(Ydiff'*iny)); %log-likelihood of input data matrix
        else
            break;
        end;
    end
    lik=lik+N*log(const1)+N*log(const2);

Best Answer

Related Solutions

Solved – Implementing EM clustering for a mixture markov model

Solved – Why isn’t the EM algorithm increasing the log-likelihood after each iteration

Related Question