Solved – Log-likelihood of multivariate Poisson distribution

multivariate analysispoisson distribution

I'm having difficulty getting the gradient of the log-likelihood of a multivariate Poisson distribution. Here's how I have it setup:

A finite set of finite-dimensional vectors $T$ with elements $\mathbf{t}$
$d$ functions $\left\{f_1,f_2,\dotsc,f_d\right\}$ with compact support.
The parameter of the multivariate Poisson is given by $\lambda_{\mathbf{t}}\left(\boldsymbol{\theta}\right) = \sum_{k=1}^{d}\theta_k f_k\left(\mathbf{t}\right)$.
A sample from this distribution looks like this: $y_\mathbf{t}\sim\textrm{ Poisson}\left(\exp\left(\lambda_{\mathbf{t}}\left(\boldsymbol{\theta}\right)\right)\right)$
Multivariate Poisson likelihood function: $$L\left(\boldsymbol\theta\right)=\prod_{\mathbf{t}\in T}\frac{\exp\left(-\lambda_{\mathbf{t}}\left(\boldsymbol{\theta}\right)\right)\left(\lambda_{\mathbf{t}}\left(\boldsymbol{\theta}\right)\right)^{y_\mathbf{t}}}{y_\mathbf{t}!}$$

Here's where I am:
\begin{align*}
l\left(\boldsymbol\theta\right)&=\sum_{\mathbf{t}\in T}\log\frac{\exp\left(-\lambda_{\mathbf{t}}\left(\boldsymbol{\theta}\right)\right)\left(\lambda_{\mathbf{t}}\left(\boldsymbol\theta\right)\right)^{y_{\mathbf{t}}}}{y_{\mathbf{t}}!}\\
&\ldots\textrm{ a little bit of algebra later }\\\
&=\sum_{\mathbf{t}\in T}\left(-\lambda_\mathbf{t}\left(\boldsymbol\theta\right) + y_\mathbf{t}\log\left(\lambda_\mathbf{t}\left(\boldsymbol\theta\right)\right)\right)-\log\left(y_\mathbf{t}!\right)
\end{align*}

What is the next step to take in terms of the derivatives? I'm not sure how to take derivatives with respect to $\boldsymbol\theta$ (i.e., what is the resulting type from $\frac{\mathrm{d}}{\mathrm{d}\,\boldsymbol\theta}\left(-\lambda_\mathbf{t}\left(\boldsymbol\theta\right)\right)$; is it a matrix, a vector, etc.). I would appreciate it if people's answers gave as little away about the problem as possible, I'd like to be able to finish deriving the equation myself; I just need a little push in the right direction. Much appreciated!

Best Answer

Well, the $\log( y_{\bf t}!)$ terms don't involve ${\boldsymbol \theta}$, so forget about them. Multivariate derivatives are just concatenations of univariate partial derivatives. By linearity, the elements of the gradient vector are

$$ \frac{ \partial \ell( {\boldsymbol \theta} )}{ \partial \theta_{i}} = \sum_{ {\bf t} \in \mathcal{T} } \frac{ -\partial \lambda_{{\bf t}}({\boldsymbol \theta})}{ \partial \theta_{i}} + y_{{\bf t}} \cdot \frac{ \partial \log (\lambda_{{\bf t}}({\boldsymbol \theta})) }{ \partial \theta_{i}} $$

Given by your expression for $\lambda_{{\bf t}}({\boldsymbol \theta})$,

$$\frac{ \partial \lambda_{{\bf t}}({\boldsymbol \theta})}{ \partial \theta_{i}} = f_{i}( {\bf t}), $$

since you're just differentiating a linear function of $\theta_{i}$. From basic single variable calculus we know that

$$ \frac{ \partial \log(f(x)) }{\partial x} = \frac{1}{f(x)} \cdot \frac{ \partial f(x) }{\partial x}$$

So,

$$ \frac{\partial \log (\lambda_{{\bf t}}({\boldsymbol \theta})) }{ \partial \theta_{i}} = \frac{ f_{i}( {\bf t}) }{\sum_{k=1}^{d}\theta_k f_k\left(\mathbf{t}\right)} $$

Plug these parts back into the first equation above to get the score function.

Related Solutions

Solved – Generating functions for Poisson regression using MATLAB

Not quite what you're asking for, but Statistics Toolbox has a handly little function called glmfit

I'm attaching some code that I've used to demo Poisson regression in the past...

%%  Generalized Linear Model Demo

% Generalized Linear Models encompasses techniques like Poisson 
% regression and Logistic regression. This technique is used to model
% counts and estimate odds.

% Formally, the technique was developed to model data where the error terms
% don't exhibit constant variance.

% Specify the average relationship between X and Y

clear all
clc

% Create a set of 100 X values between 0 and 15

X = linspace(0,15,100);
X = X';

% Define a function that describes the average relationship between X and
% Y.  

mu = @(x) exp(-1.5 +.25*x);
Y = mu(X);

plot(X,Y); 
xlabel('Time');
ylabel('Count');

%%  Use the Y = f(X) to generate a dataset with known characteristics

%  At each "X", the observed Y is represented as a draw from a Poisson
%  distribution with mean (lambda) = Y

Noisy_Y = poissrnd(Y,100,1);

% Create figure
figure1 = figure;

% Create axes
axes1 = axes('Parent',figure1,'YTick',[0:max(Noisy_Y)], 'YGrid','on');
hold(axes1,'all');

% Create scatter
scatter(X,Noisy_Y, 'b');

xlabel('Time')
ylabel('Count')

% Note the following:

% The mean of a poisson process is also the variance of a poisson process.
% The blue curve doesn't exhibit constant variance (this violates one of
% the key assumptions underlying Ordinary Least Squares)

% The output from a Poisson process is always a non-negative integer.  If
% lambda is close to zero AND observations can never going
% non-negative, the shape of the distribution is bounded in one direction.
% As lambda gets larger, this boundary has less and less impact on the
% distribution of the errors.

hold on
plot(X,Y); 

%%  Create a fit using nonlinear regression

% The model used for the nonlinear regression is the same as "mu"

foo = fit(X, Noisy_Y, 'exp(B1 + B2*x)');
plot(foo)

xlabel('Time')
ylabel('Count')

SSE_OLS = sum((foo(X) - Noisy_Y).^2)

% Create textbox
annotation(figure1,'textbox',...
    [0.147443514644351 0.802739726027397 0.255903765690377 0.0931506849315069],...
    'String',{['SSE OLS = ' num2str(SSE_OLS)]},...
    'FitBoxToText','off',...
    'BackgroundColor',[1 1 1]);


%% Create a fit using a Generalized Linear Model

[b dev stats] = glmfit(X,Noisy_Y, 'poisson');

Predicted_Y = exp(b(1) + b(2)*X);
plot(X, Predicted_Y, 'k');

SSE_GLM = sum(stats.resid.^2)

% Create textbox
annotation(figure1,'textbox',...
    [0.147443514644351 0.802739726027397 0.255903765690377 0.0931506849315069],...
    'String',{['SSE OLS = ' num2str(SSE_OLS)], ['SSE GLM = ' num2str(SSE_GLM)]},...
    'FitBoxToText','off',...
    'BackgroundColor',[1 1 1]);

%  The two curves (should) look almost the same
%  The SSE is (essentially) the same.
% What gives?  Why are we bothering with this GLM stuff?

%% Perform a Monte Carlo simulation

% Generate 1000 data sets that are consistent with with the assumed
% relationship between Count and Time

parfor i = 1:1000

     Noisy_Y = poissrnd(Y, 100,1);

     b = glmfit(X,Noisy_Y, 'poisson');

     GLM_Prediction(:,i) = exp(b(1) + b(2)*X); 

     % Provide optimal starting conditions (make sure that ...
     % any differences are related to the algorithms rather than
     % convergence

     foo = fit(X, Noisy_Y, 'exp(B1 + B2*x)', 'Startpoint', [-1.5, .25]);

     B = coeffvalues(foo);

     OLS_Prediction(:,i) = exp(B(1) + B(2)*X);

end

figure

True = plot(X, Y, 'b');
hold on
GLM_Mean = plot(X, mean(GLM_Prediction, 2),'k')
OLS_Mean = plot(X, mean(OLS_Prediction, 2),'r')
xlabel('Time')
ylabel('Count')

% The means of the two two techniques are, for all intents and purposes,
% identical.

%%

% The standard deviation from the GLM is MUCH smaller.  Also note that 
% I provided the nonlinear regression with optimal starting conditions.

% Formally, the advantage to using a GLM has to do with the scope of the
% technique (The range of conditions over which the GLM will generate a
% good fit).  GLMs have a much broader scope.  Conversely, using a
% nonlinear regression to fit the inverse of the link function can (badly)
% misfire on occasion.

% glmfit is more robust

GLM_std = std(GLM_Prediction, [], 2);
OLS_std = std(OLS_Prediction, [], 2);

figure
plot(X, OLS_std./GLM_std)
xlabel('Time')
ylabel('Ratio of OLS std:GLM std')

Generalized Linear Model – Using EM Algorithm to Calculate MLEs for a Zero Inflated Poisson Model

The root of the difficulty you are having lies in the sentence:

Then using the EM algorithm, we can maximize the second log-likelihood.

As you have observed, you can't. Instead, what you maximize is the expected value of the second log likelihood (known as the "complete data log likelihood"), where the expected value is taken over the $z_i$.

This leads to an iterative procedure, where at the $k^{th}$ iteration you calculate the expected values of the $z_i$ given the parameter estimates from the $(k-1)^{th}$ iteration (this is known as the "E-step",) then substitute them into the complete data log likelihood (see EDIT below for why we can do this in this case), and maximize that with respect to the parameters to get the estimates for the current iteration (the "M-step".)

The complete-data log likelihood for the zero-inflated Poisson in the simplest case - two parameters, say $\lambda$ and $p$ - allows for substantial simplification when it comes to the M-step, and this carries over to some extent to your form. I'll show you how that works in the simple case via some R code, so you can see the essence of it. I won't simplify as much as possible, since that might cause a loss of clarity when you think of your problem:

# Generate data
# Lambda = 1,  p(zero) = 0.1
x <- rpois(10000,1)
x[1:1000] <- 0

# Sufficient statistic for the ZIP
sum.x <- sum(x)

# (Poor) starting values for parameter estimates
phat <- 0.5
lhat <- 2.0

zhat <- rep(0,length(x))
for (i in 1:100) {
  # zhat[x>0] <- 0 always, so no need to make the assignment at every iteration
  zhat[x==0] <- phat/(phat +  (1-phat)*exp(-lhat))

  lhat <- sum.x/sum(1-zhat) # in effect, removing E(# zeroes due to z=1)
  phat <- mean(zhat)   

  cat("Iteration: ",i, "  lhat: ",lhat, "  phat: ", phat,"\n")
}

Iteration:  1   lhat:  1.443948   phat:  0.3792712 
Iteration:  2   lhat:  1.300164   phat:  0.3106252 
Iteration:  3   lhat:  1.225007   phat:  0.268331 
...
Iteration:  99   lhat:  0.9883329   phat:  0.09311933 
Iteration:  100   lhat:  0.9883194   phat:  0.09310694

In your case, at each step you'll do a weighted Poisson regression where the weights are 1-zhat to get the estimates of $\beta$ and therefore $\lambda_i$, and then maximize:

$\sum (\mathbb{E}z_i\log{p_i} + (1-\mathbb{E}z_i)\log{(1-p_i)})$

with respect to the coefficient vector of your matrix $\mathbf{G}$ to get the estimates of $p_i$. The expected values $\mathbb{E}z_i = p_i/(p_i+(1-p_i)\exp{(-\lambda_i)})$, again calculated at each iteration.

If you want to do this for real data, as opposed to just understanding the algorithm, R packages already exist; here's an example http://www.ats.ucla.edu/stat/r/dae/zipoisson.htm using the pscl library.

EDIT: I should emphasize that what we are doing is maximizing the expected value of the complete-data log likelihood, NOT maximizing the complete-data log likelihood with the expected values of the missing data/latent variables plugged in. As it happens, if the complete-data log likelihood is linear in the missing data, as it is here, the two approaches are the same, but otherwise, they aren't.

Best Answer

Related Solutions

Solved – Generating functions for Poisson regression using MATLAB

Generalized Linear Model – Using EM Algorithm to Calculate MLEs for a Zero Inflated Poisson Model

Related Question