Solved – Why isn’t the EM algorithm increasing the log-likelihood after each iteration

expectation-maximization

I used EM algorithm to estimate the parameters of the following time invariant process.
$$y=Ct+e_y$$
$$x=Pt+e_x$$

where $y \in R$, $x \in R^d$, $t\sim N(0,I^{p \times p})$, $e_y \sim N(0,\Sigma_y)$, and $e_x \sim N(0,\Sigma_x)$. Both $\Sigma_x$ and $\Sigma_y$ are diagonal covariance matrices. The set of parameters to be estimated include $\theta=(C,P,\Sigma_x, \Sigma_y)$. I first derived the joint probability density function $p(x,y,t)$ as following.
$$p(x,y,t)=N(Ct,\Sigma_y)N(Pt,\Sigma_x)N(0,I)\\=x^T\Sigma_x^{-1}x+y^T\Sigma_y^{-1}y-2t^T(C^T\Sigma_y^{-1}y+P^T\Sigma_x^{-1}x)+t^T(C^T\Sigma_y^{-1}C+P^T\Sigma_x^{-1}P+I)t$$
$$$$
Knowing that $p(t|x,y)\propto p(x,y,t)$,
$$Var[t|x,y]=(C^T\Sigma_y^{-1}C+P^T\Sigma_x^{-1}P+I)^{-1}$$
$$E[t|x,y]=(C^T\Sigma_y^{-1}C+P^T\Sigma_x^{-1}P+I)^{-1}(C^T\Sigma_y^{-1}y+P^T\Sigma_x^{-1}x)$$
$$Var[t|x,y]=E[(t-E[t|x,y])(t-E[t|x,y])^T]=E[tt^T|x,y]-E[t|x,y]E[t^T|x,y]$$
$$E[tt^T|x,y]=Var[t|x,y]+E[t|x,y]E[t^T|x,y]$$
$$$$
Now I have the required posterior distributions for the M-step, including $E[t] $ and $E[tt^T]$ (for conciseness $|x,y$ is omitted). In the M-step, considering the distributions for $y$ and $x$ are independent of each other, $P$, $C$, $\Sigma_x$ and $\Sigma_y$ can be maximized independent of each other as well. The complete log-likelihood function of the EM algorithm is derived as followings.
$$L(X=[x_n]_{n=1}^N;\theta)=\dfrac{N}{2}\ln|\Sigma_x^{-1}|-\dfrac{1}{2}trace[\Sigma_x^{-1} \sum_{n=1}^N(x_nx_n^T-x_nE[t_n^T]P^T-PE[t_n]x_n^T+PE[t_nt_n^T]P^T)]$$

$$L(Y=[y_n]_{n=1}^N;\theta)=\dfrac{N}{2}\ln|\Sigma_y^{-1}|-\dfrac{1}{2}trace[\Sigma_y^{-1} \sum_{n=1}^N(y_ny_n^T-y_nE[t_n^T]C^T-CE[t_n]y_n^T+CE[t_nt_n^T]C^T)]$$

Subsequently, taking the derivatives of the above equations with respect to each parameter and setting these derivatives to zero yield (derivation similar to equations 13 to 16 of http://mlg.eng.cam.ac.uk/zoubin/course04/tr-96-2.pdf):
$$C_{new}=(\sum_{n=1}^Ny_nE[t_n^T])(\sum_{n=1}^NE[t_nt_n^T])^{-1}$$
$$P_{new}=(\sum_{n=1}^Nx_nE[t_n^T])(\sum_{n=1}^NE[t_nt_n^T])^{-1}$$
$$\Sigma_x=\dfrac{1}{N}\sum_{n=1}^N(x_nx_n^T-P_{new}E[t_n]x_n^T)$$
$$\Sigma_y=\dfrac{1}{N}\sum_{n=1}^N(y_ny_n^T-C_{new}E[t_n]y_n^T)$$
This completes one step of EM. The joint likelihood of the data is computed as following, since $p(x)$ and $p(y)$ are independent of each other (not sure if it is correct).This log-likelihood should increase after each step of EM.
$$L(X,Y;\theta_{new})=N\ln(2\pi)^{-1/2}+N\ln(2\pi)^{-d/2}+N\ln|\Sigma_y+C_{new}C^T_{new}|^{-1/2}+N\ln|\Sigma_x+P_{new}P^T_{new}|^{-1/2}-\dfrac{1}{2}\sum_{n=1}^Nx_n^T(\Sigma_x+P_{new}P^T_{new})^{-1}x_n+y_n^T(\Sigma_y+C_{new}C^T_{new})^{-1}y_n$$

I have implemented this in matlab. The log-likelihood is decreasing after each step. I am attaching the code and the testing data with this question. Could anybody help me identify the problem?

function net=PPLS_EM(X,Y,p,iter,tol)

net=struct('type','PPLS_EM','C',[],'P',[],'Ex',[],'Ey',[],'mux',[],'muy',    [],'LL',[]);
[N,din]=size(X); %N by d input data matrix column [6 9 11 13 16 18] of the d00_te. data
[~,dout]=size(Y); %N by 1 output data matrix column 7 of the d00_te.dat

const1=(2*pi)^(-din/2);
const2=(2*pi)^(-dout/2);
lik=0;
LL=[];
tiny=exp(-700);

Vcur=zeros(p,p,N);
tcur=zeros(p,N);

YY=diag(sum(Y.*Y)');
XX=diag(sum(X.*X)');

if nargin<5
tol=0.0001;
end

if nargin<4
iter=100;
end

if nargin<3
p=2;
end


%% Initilization
C=ones(dout,p);
Ey=ones(dout,1);
Ey=diag(Ey);

P=ones(din,p);
Ex=ones(din,1);
Ex=diag(Ex);

Ex=Ex+(Ex==0)*tiny;
Ey=Ey+(Ey==0)*tiny;

Etxsum=0;
Etysum=0;
Vttsum=0;

for t=1:iter
%% E-step

oldlik=lik;
for n=1:N
    tcur(:,n)=inv(eye(p)+P'*inv(Ex)*P+C'*inv(Ey)*C)*(P'*inv(Ex)*X(n,:)'+C'*inv(Ey)*Y(n,:)');
    Vcur(:,:,n)=inv(eye(p)+P'*inv(Ex)*P+C'*inv(Ey)*C)+tcur(:,n)*tcur(:,n)';

    invx=inv(Ex+P*P');
    invy=inv(Ey+C*C');

    Etx=tcur(:,n)*X(n,:);
    Ety=tcur(:,n)*Y(n,:);

    Etxsum=Etxsum+Etx;
    Etysum=Etysum+Ety;

    Vttsum=Vttsum+Vcur(:,:,n);

    % calculate likelihood
    detiEx=sqrt(det(invx));
    detiEy=sqrt(det(invy));
    if (isreal(detiEx) && detiEx>0)
        lik=lik+log(detiEx)-0.5*sum(sum(X(n,:).*(X(n,:)*invx)))+log(detiEy)-0.5*sum(sum(Y(n,:).*(Y(n,:)*invy))); %log-likelihood of input data matrix
    else
        break;
    end;
end
lik=lik+N*log(const1)+N*log(const2);

if (t<=2)
    likbase=lik;
elseif (lik<oldlik)
    %fprintf('Oops!');
elseif ((lik-likbase)<(1 + tol)*(oldlik-likbase)||~isfinite(lik))
    break;
end;

LL=[LL lik];

%% M-step
C=Etysum'*inv(Vttsum);
P=Etxsum'*inv(Vttsum);
Ex=diag(diag(XX-(P*Etxsum)))/N;
Ey=diag(diag(YY-(C*Etysum)))/N;
end

net.C=C;
net.P=P;
net.Ex=Ex;
net.Ey=Ey;
net.LL=LL;

Data can be accessed at https://www.dropbox.com/s/i36kw4qq6al0pmn/d00_te.dat?dl=0

Results

$C_{new}$ for $p=2$ for 10 steps
val(:,:,1) =

-6.3224    1.7853

val(:,:,2) =

-6.3503    1.7866

val(:,:,3) =

-2.9558    1.9637

val(:,:,4) =

-1.6677    1.0381

val(:,:,5) =

-1.6664    1.0375

val(:,:,6) =

-1.6651    1.0368

val(:,:,7) =

-1.6637    1.0362

val(:,:,8) =

-1.6623    1.0355

val(:,:,9) =

-1.6609    1.0349

val(:,:,10) =

-1.6595    1.0342

$P_{new}$ for 10 steps

val(:,:,1) =

 0.0097   -0.0013
 0.0030    0.0199
 0.1186    0.0240
-6.5193    1.8837
-5.6963    1.8107
 0.2769   -0.0450

val(:,:,2) =

 0.0098   -0.0013
 0.0030    0.0198
 0.1193    0.0237
-6.5530    1.8852
-5.6773    1.8080
 0.2731   -0.0448

val(:,:,3) =

 0.0052    0.0015
-0.0059    0.0052
 0.0591   -0.0072
-3.0491    2.0738
-2.6955    1.7138
 0.1382   -0.0148

val(:,:,4) =

 0.0042    0.0028
-0.0050    0.0045
 0.0338    0.0123
-1.7145    1.1179
-1.5561    0.8878
 0.0927    0.0243

val(:,:,5) =

 0.0042    0.0028
-0.0050    0.0045
 0.0338    0.0123
-1.7131    1.1172
-1.5549    0.8872
 0.0927    0.0244

val(:,:,6) =

 0.0042    0.0028
-0.0050    0.0045
 0.0338    0.0123
-1.7116    1.1165
-1.5537    0.8867
 0.0927    0.0244

val(:,:,7) =

 0.0042    0.0028
-0.0050    0.0045
 0.0338    0.0123
-1.7101    1.1158
-1.5526    0.8861
 0.0927    0.0244

val(:,:,8) =

 0.0042    0.0027
-0.0050    0.0045
 0.0338    0.0123
-1.7086    1.1151
-1.5514    0.8855
 0.0926    0.0244

val(:,:,9) =

 0.0043    0.0027
-0.0049    0.0045
 0.0337    0.0124
-1.7071    1.1144
-1.5501    0.8849
 0.0926    0.0244

val(:,:,10) =

 0.0043    0.0027
-0.0049    0.0045
 0.0337    0.0124
-1.7055    1.1136
-1.5489    0.8843
 0.0926    0.0244

Log likelihood for 10 steps
-6176.77412636697 + 0.00000000000000i
-12350.0028284986 + 0.00000000000000i
-48330.2798829560 + 0.00000000000000i
-54546.8694822146 + 1507.96447372310i
-60722.1364253500 + 1507.96447372310i
-66897.4033684854 + 1507.96447372310i
-73072.6703116208 + 1507.96447372310i
-79247.9372547562 + 1507.96447372310i
-85423.2041978916 + 1507.96447372310i
-91598.4711410270 + 1507.96447372310i

Best Answer

Problem Solved

The decreasing Likelihood was due to a coding problem. The accumulation of $lik$ is not only in the for loop for the data samples but also in the for loop for E-M step. This leads to adding the $lik$ of current EM-step onto the $lik$ of the previous EM-step which of course keeps decreasing $lik$. I am attaching the revised code here for anyone interested. No restriction on how you might use it.

function net=PPLS_EM(X,Y,p,iter,tol)

net=struct('type','PPLS_EM','C',[],'P',[],'Ex',[],'Ey',[],'LL',[]);
[N,~]=size(X); %N by p input data matrix

LL=[];
lik=0;

YY=diag(sum(Y.*Y)');
XX=diag(sum(X.*X)');

if nargin<5
tol=0.0001;
end

if nargin<4
iter=100;
end

if nargin<3
p=2;
end


%% Initilization

C=ones(dout,p);
Ey=ones(dout,1);
Ey=diag(Ey);

P=ones(din,p);
Ex=ones(din,1);
Ex=diag(Ex);

for t=1:iter
%% E-step
oldlik=lik;
[lik,Vtt,Etxsum,Etysum]=PPLS_E(C,P,Ex,Ey,X,Y);
if (t<=2)
    likbase=lik;
elseif (lik<oldlik)
    %fprintf('Oops!');
elseif ((lik-likbase)<(1 + tol)*(oldlik-likbase)||~isfinite(lik))
    break;
end;

LL=[LL lik];

%% M-step
C=Etysum'*inv(Vtt);
P=Etxsum'*inv(Vtt);
Ex=diag(diag(XX-(P*Etxsum)))/N;
Ey=diag(diag(YY-(C*Etysum)))/N;
end

net.C=C;
net.P=P;
net.Ex=Ex;
net.Ey=Ey;
net.LL=LL;

And the code for PPLS_E

    function [lik,Vtt,Etxsum,Etysum]=PPLS_E(C,P,Ex,Ey,X,Y)

    [N,din]=size(X);
    [~,dout]=size(Y);

    const1=(2*pi)^(-din/2);
    const2=(2*pi)^(-dout/2);

    lik=0;
    p=size(P,2);

    Vcur=zeros(p,p,N);
    tcur=zeros(p,N);

    tiny=exp(-700);
    I=eye(p);

    Ex=Ex+(Ex==0)*tiny;
    Ey=Ey+(Ey==0)*tiny;

    Etxsum=zeros(p,din);
    Etysum=zeros(p,dout);
    Vtt=0;

    Sigma_XY=I+C'*inv(Ey)*C+P'*inv(Ex)*P;
    for n=1:N
        tcur(:,n)=inv(Sigma_XY)*(P'*inv(Ex)*X(n,:)'+C'*inv(Ey)*Y(n,:)');
        Vcur(:,:,n)=inv(Sigma_XY)+tcur(:,n)*tcur(:,n)';

        Etx=tcur(:,n)*X(n,:);
        Ety=tcur(:,n)*Y(n,:);

        Etxsum=Etxsum+Etx;
        Etysum=Etysum+Ety;

        Vtt=Vtt+Vcur(:,:,n);

        % calculate likelihood
        Ydiff=Y(n,:)'-C*tcur(:,n);
        Xdiff=X(n,:)'-P*tcur(:,n);
        inx=inv(Ex+P*Vcur(:,:,n)*P');
        iny=inv(Ey+C*Vcur(:,:,n)*C');
        detiEx=sqrt(det(inx));
        detiEy=sqrt(det(iny));
        if (isreal(detiEx) && detiEx>0 && isreal(detiEy) && detiEy>0)
            lik=lik+log(detiEx)-0.5*sum(Xdiff'.*(Xdiff'*inx))+log(detiEy)-0.5*sum(Ydiff'.*(Ydiff'*iny)); %log-likelihood of input data matrix
        else
            break;
        end;
    end
    lik=lik+N*log(const1)+N*log(const2);
Related Question