Not quite what you're asking for, but Statistics Toolbox has a handly little function called glmfit
I'm attaching some code that I've used to demo Poisson regression in the past...
%% Generalized Linear Model Demo
% Generalized Linear Models encompasses techniques like Poisson
% regression and Logistic regression. This technique is used to model
% counts and estimate odds.
% Formally, the technique was developed to model data where the error terms
% don't exhibit constant variance.
% Specify the average relationship between X and Y
clear all
clc
% Create a set of 100 X values between 0 and 15
X = linspace(0,15,100);
X = X';
% Define a function that describes the average relationship between X and
% Y.
mu = @(x) exp(-1.5 +.25*x);
Y = mu(X);
plot(X,Y);
xlabel('Time');
ylabel('Count');
%% Use the Y = f(X) to generate a dataset with known characteristics
% At each "X", the observed Y is represented as a draw from a Poisson
% distribution with mean (lambda) = Y
Noisy_Y = poissrnd(Y,100,1);
% Create figure
figure1 = figure;
% Create axes
axes1 = axes('Parent',figure1,'YTick',[0:max(Noisy_Y)], 'YGrid','on');
hold(axes1,'all');
% Create scatter
scatter(X,Noisy_Y, 'b');
xlabel('Time')
ylabel('Count')
% Note the following:
% The mean of a poisson process is also the variance of a poisson process.
% The blue curve doesn't exhibit constant variance (this violates one of
% the key assumptions underlying Ordinary Least Squares)
% The output from a Poisson process is always a non-negative integer. If
% lambda is close to zero AND observations can never going
% non-negative, the shape of the distribution is bounded in one direction.
% As lambda gets larger, this boundary has less and less impact on the
% distribution of the errors.
hold on
plot(X,Y);
%% Create a fit using nonlinear regression
% The model used for the nonlinear regression is the same as "mu"
foo = fit(X, Noisy_Y, 'exp(B1 + B2*x)');
plot(foo)
xlabel('Time')
ylabel('Count')
SSE_OLS = sum((foo(X) - Noisy_Y).^2)
% Create textbox
annotation(figure1,'textbox',...
[0.147443514644351 0.802739726027397 0.255903765690377 0.0931506849315069],...
'String',{['SSE OLS = ' num2str(SSE_OLS)]},...
'FitBoxToText','off',...
'BackgroundColor',[1 1 1]);
%% Create a fit using a Generalized Linear Model
[b dev stats] = glmfit(X,Noisy_Y, 'poisson');
Predicted_Y = exp(b(1) + b(2)*X);
plot(X, Predicted_Y, 'k');
SSE_GLM = sum(stats.resid.^2)
% Create textbox
annotation(figure1,'textbox',...
[0.147443514644351 0.802739726027397 0.255903765690377 0.0931506849315069],...
'String',{['SSE OLS = ' num2str(SSE_OLS)], ['SSE GLM = ' num2str(SSE_GLM)]},...
'FitBoxToText','off',...
'BackgroundColor',[1 1 1]);
% The two curves (should) look almost the same
% The SSE is (essentially) the same.
% What gives? Why are we bothering with this GLM stuff?
%% Perform a Monte Carlo simulation
% Generate 1000 data sets that are consistent with with the assumed
% relationship between Count and Time
parfor i = 1:1000
Noisy_Y = poissrnd(Y, 100,1);
b = glmfit(X,Noisy_Y, 'poisson');
GLM_Prediction(:,i) = exp(b(1) + b(2)*X);
% Provide optimal starting conditions (make sure that ...
% any differences are related to the algorithms rather than
% convergence
foo = fit(X, Noisy_Y, 'exp(B1 + B2*x)', 'Startpoint', [-1.5, .25]);
B = coeffvalues(foo);
OLS_Prediction(:,i) = exp(B(1) + B(2)*X);
end
figure
True = plot(X, Y, 'b');
hold on
GLM_Mean = plot(X, mean(GLM_Prediction, 2),'k')
OLS_Mean = plot(X, mean(OLS_Prediction, 2),'r')
xlabel('Time')
ylabel('Count')
% The means of the two two techniques are, for all intents and purposes,
% identical.
%%
% The standard deviation from the GLM is MUCH smaller. Also note that
% I provided the nonlinear regression with optimal starting conditions.
% Formally, the advantage to using a GLM has to do with the scope of the
% technique (The range of conditions over which the GLM will generate a
% good fit). GLMs have a much broader scope. Conversely, using a
% nonlinear regression to fit the inverse of the link function can (badly)
% misfire on occasion.
% glmfit is more robust
GLM_std = std(GLM_Prediction, [], 2);
OLS_std = std(OLS_Prediction, [], 2);
figure
plot(X, OLS_std./GLM_std)
xlabel('Time')
ylabel('Ratio of OLS std:GLM std')
The root of the difficulty you are having lies in the sentence:
Then using the EM algorithm, we can maximize the second
log-likelihood.
As you have observed, you can't. Instead, what you maximize is the expected value of the second log likelihood (known as the "complete data log likelihood"), where the expected value is taken over the $z_i$.
This leads to an iterative procedure, where at the $k^{th}$ iteration you calculate the expected values of the $z_i$ given the parameter estimates from the $(k-1)^{th}$ iteration (this is known as the "E-step",) then substitute them into the complete data log likelihood (see EDIT below for why we can do this in this case), and maximize that with respect to the parameters to get the estimates for the current iteration (the "M-step".)
The complete-data log likelihood for the zero-inflated Poisson in the simplest case - two parameters, say $\lambda$ and $p$ - allows for substantial simplification when it comes to the M-step, and this carries over to some extent to your form. I'll show you how that works in the simple case via some R code, so you can see the essence of it. I won't simplify as much as possible, since that might cause a loss of clarity when you think of your problem:
# Generate data
# Lambda = 1, p(zero) = 0.1
x <- rpois(10000,1)
x[1:1000] <- 0
# Sufficient statistic for the ZIP
sum.x <- sum(x)
# (Poor) starting values for parameter estimates
phat <- 0.5
lhat <- 2.0
zhat <- rep(0,length(x))
for (i in 1:100) {
# zhat[x>0] <- 0 always, so no need to make the assignment at every iteration
zhat[x==0] <- phat/(phat + (1-phat)*exp(-lhat))
lhat <- sum.x/sum(1-zhat) # in effect, removing E(# zeroes due to z=1)
phat <- mean(zhat)
cat("Iteration: ",i, " lhat: ",lhat, " phat: ", phat,"\n")
}
Iteration: 1 lhat: 1.443948 phat: 0.3792712
Iteration: 2 lhat: 1.300164 phat: 0.3106252
Iteration: 3 lhat: 1.225007 phat: 0.268331
...
Iteration: 99 lhat: 0.9883329 phat: 0.09311933
Iteration: 100 lhat: 0.9883194 phat: 0.09310694
In your case, at each step you'll do a weighted Poisson regression where the weights are 1-zhat
to get the estimates of $\beta$ and therefore $\lambda_i$, and then maximize:
$\sum (\mathbb{E}z_i\log{p_i} + (1-\mathbb{E}z_i)\log{(1-p_i)})$
with respect to the coefficient vector of your matrix $\mathbf{G}$ to get the estimates of $p_i$. The expected values $\mathbb{E}z_i = p_i/(p_i+(1-p_i)\exp{(-\lambda_i)})$, again calculated at each iteration.
If you want to do this for real data, as opposed to just understanding the algorithm, R packages already exist; here's an example http://www.ats.ucla.edu/stat/r/dae/zipoisson.htm using the pscl
library.
EDIT: I should emphasize that what we are doing is maximizing the expected value of the complete-data log likelihood, NOT maximizing the complete-data log likelihood with the expected values of the missing data/latent variables plugged in. As it happens, if the complete-data log likelihood is linear in the missing data, as it is here, the two approaches are the same, but otherwise, they aren't.
Best Answer
Well, the $\log( y_{\bf t}!)$ terms don't involve ${\boldsymbol \theta}$, so forget about them. Multivariate derivatives are just concatenations of univariate partial derivatives. By linearity, the elements of the gradient vector are
$$ \frac{ \partial \ell( {\boldsymbol \theta} )}{ \partial \theta_{i}} = \sum_{ {\bf t} \in \mathcal{T} } \frac{ -\partial \lambda_{{\bf t}}({\boldsymbol \theta})}{ \partial \theta_{i}} + y_{{\bf t}} \cdot \frac{ \partial \log (\lambda_{{\bf t}}({\boldsymbol \theta})) }{ \partial \theta_{i}} $$
Given by your expression for $\lambda_{{\bf t}}({\boldsymbol \theta})$,
$$\frac{ \partial \lambda_{{\bf t}}({\boldsymbol \theta})}{ \partial \theta_{i}} = f_{i}( {\bf t}), $$
since you're just differentiating a linear function of $\theta_{i}$. From basic single variable calculus we know that
$$ \frac{ \partial \log(f(x)) }{\partial x} = \frac{1}{f(x)} \cdot \frac{ \partial f(x) }{\partial x}$$
So,
$$ \frac{\partial \log (\lambda_{{\bf t}}({\boldsymbol \theta})) }{ \partial \theta_{i}} = \frac{ f_{i}( {\bf t}) }{\sum_{k=1}^{d}\theta_k f_k\left(\mathbf{t}\right)} $$
Plug these parts back into the first equation above to get the score function.