The zero inflated Poisson regression model is defined for a sample $(y_1,\ldots,y_n)$ by
$$
Y_i = \begin{cases} 0 & \text{with probability} \ p_i+(1-p_i)e^{-\lambda_i}\\
k & \text{with probability} \ (1-p_i)e^{-\lambda_i} \lambda_{i}^{k}/k!
\end{cases}$$
and it further assumes that the parameters $\mathbf{\lambda} = (\lambda_1, \dots, \lambda_n)$ and $\textbf{p} = (p_1, \dots, p_n)$ satisfy
$$\eqalign{
\log(\mathbf{\lambda}) &= \textbf{B} \beta \\
\text{logit}(\textbf{p}) &= \log(\textbf{p}/(1-\textbf{p})) = \textbf{G} \mathbf{\gamma}.
}$$
The corresponding log likelihood of the zero-inflated Poisson regression model is
$$\eqalign{
L(\gamma,\mathbf{\beta}; \mathbf{y}) &= \sum_{y_i=0} \log(e^{G_i \gamma}+\exp(-e^{\textbf{B}_i \mathbf{\beta}})) +\sum_{y_i >0} (y_i \textbf{B}_i \mathbf{\beta}-e^{\textbf{B}_i \mathbf{\beta}})\\
&\quad -\sum_{i=1}^{n} \log(1+e^{G_{i} \gamma})-\sum_{y_i >0} \log(y_{i}!)}$$
Here, $\mathrm{B}$ and $\mathrm{G}$ are the design matrices. These matrices could be the same, depending on the features one desires to use for the two generating processes. They have the same number of rows, however.
Assuming that we could observe $Z_i = 1$ when $Y_i$ is from the perfect, zero state and $Z_i = 0$ when $Y_i$ is from the Poisson state the log-likelihood would be
$$L(\gamma,\mathbf{\beta}; \mathbf{y}, \mathbf{z}) = \sum_{i=1}^{n} \log(f(z_i|\mathbf{\gamma}))+\sum_{i=1}^{n} \log(f(y_i|z_i, \mathbf{\beta}))$$
$$ = \sum_{i=1}^{n} z_{i} (\textbf{G}_i \gamma-\log(1+e^{G_{i} \gamma}))+ -\sum_{i=1}^{n} (1-z_{i})\log(1+e^{G_{i} \gamma})+ \sum_{i=1}^{n} (1-z_i)[y_{i} \textbf{B}_i \beta-e^{\textbf{B}_i \beta} – \log(y_{i}!)]$$
The first two terms are the loss in a logistic regression to separate $z_i=0$ from $z_i=1$. The second term is a regression to the points generated by the Poisson process.
But aren't latent variables unobservable? The purpose is to maximize the first log-likelihood. But we have to introduce latent variables and derive a new log-likelihood. Then using the EM algorithm, we can maximize the second log-likelihood. But this assumes that we know that either $Z_i = 0$ or $Z_i = 1$?
Best Answer
The root of the difficulty you are having lies in the sentence:
As you have observed, you can't. Instead, what you maximize is the expected value of the second log likelihood (known as the "complete data log likelihood"), where the expected value is taken over the $z_i$.
This leads to an iterative procedure, where at the $k^{th}$ iteration you calculate the expected values of the $z_i$ given the parameter estimates from the $(k-1)^{th}$ iteration (this is known as the "E-step",) then substitute them into the complete data log likelihood (see EDIT below for why we can do this in this case), and maximize that with respect to the parameters to get the estimates for the current iteration (the "M-step".)
The complete-data log likelihood for the zero-inflated Poisson in the simplest case - two parameters, say $\lambda$ and $p$ - allows for substantial simplification when it comes to the M-step, and this carries over to some extent to your form. I'll show you how that works in the simple case via some R code, so you can see the essence of it. I won't simplify as much as possible, since that might cause a loss of clarity when you think of your problem:
In your case, at each step you'll do a weighted Poisson regression where the weights are
1-zhat
to get the estimates of $\beta$ and therefore $\lambda_i$, and then maximize:$\sum (\mathbb{E}z_i\log{p_i} + (1-\mathbb{E}z_i)\log{(1-p_i)})$
with respect to the coefficient vector of your matrix $\mathbf{G}$ to get the estimates of $p_i$. The expected values $\mathbb{E}z_i = p_i/(p_i+(1-p_i)\exp{(-\lambda_i)})$, again calculated at each iteration.
If you want to do this for real data, as opposed to just understanding the algorithm, R packages already exist; here's an example http://www.ats.ucla.edu/stat/r/dae/zipoisson.htm using the
pscl
library.EDIT: I should emphasize that what we are doing is maximizing the expected value of the complete-data log likelihood, NOT maximizing the complete-data log likelihood with the expected values of the missing data/latent variables plugged in. As it happens, if the complete-data log likelihood is linear in the missing data, as it is here, the two approaches are the same, but otherwise, they aren't.