Technically LDA Gibbs sampling works because we intentionally set up a Markov chain that converges into the posterior distribution of the model parameters, or word–topic assignments. See http://en.wikipedia.org/wiki/Gibbs_sampling#Mathematical_background.
But I guess you are seeking an intuitive answer on why the sampler tends to put similar words into the same topic? That's an interesting question. If you look at the equations for collapsed Gibbs sampling, there is a factor for words, another for documents. Probabilities are higher for assignments that "don't break document boundaries", that is, words appearing in the same document have a slightly higher odds of ending up in the same topic. The same holds for document assignments, they to a degree follow "word boundaries". These effects mix up and spread over clusters of documents and words, eventually.
By the way, LDA Gibbs samplers do not actually work properly, in the sense that they do not mix, or are not able to represent the posterior distribution well. If they did, the permutation symmetries of the model would make all solutions obtained by samplers useless, or at least non-interpretable. Instead the sampler sticks around a local mode (of the likelihood), and we get well-defined topics.
The root of the difficulty you are having lies in the sentence:
Then using the EM algorithm, we can maximize the second
log-likelihood.
As you have observed, you can't. Instead, what you maximize is the expected value of the second log likelihood (known as the "complete data log likelihood"), where the expected value is taken over the $z_i$.
This leads to an iterative procedure, where at the $k^{th}$ iteration you calculate the expected values of the $z_i$ given the parameter estimates from the $(k-1)^{th}$ iteration (this is known as the "E-step",) then substitute them into the complete data log likelihood (see EDIT below for why we can do this in this case), and maximize that with respect to the parameters to get the estimates for the current iteration (the "M-step".)
The complete-data log likelihood for the zero-inflated Poisson in the simplest case - two parameters, say $\lambda$ and $p$ - allows for substantial simplification when it comes to the M-step, and this carries over to some extent to your form. I'll show you how that works in the simple case via some R code, so you can see the essence of it. I won't simplify as much as possible, since that might cause a loss of clarity when you think of your problem:
# Generate data
# Lambda = 1, p(zero) = 0.1
x <- rpois(10000,1)
x[1:1000] <- 0
# Sufficient statistic for the ZIP
sum.x <- sum(x)
# (Poor) starting values for parameter estimates
phat <- 0.5
lhat <- 2.0
zhat <- rep(0,length(x))
for (i in 1:100) {
# zhat[x>0] <- 0 always, so no need to make the assignment at every iteration
zhat[x==0] <- phat/(phat + (1-phat)*exp(-lhat))
lhat <- sum.x/sum(1-zhat) # in effect, removing E(# zeroes due to z=1)
phat <- mean(zhat)
cat("Iteration: ",i, " lhat: ",lhat, " phat: ", phat,"\n")
}
Iteration: 1 lhat: 1.443948 phat: 0.3792712
Iteration: 2 lhat: 1.300164 phat: 0.3106252
Iteration: 3 lhat: 1.225007 phat: 0.268331
...
Iteration: 99 lhat: 0.9883329 phat: 0.09311933
Iteration: 100 lhat: 0.9883194 phat: 0.09310694
In your case, at each step you'll do a weighted Poisson regression where the weights are 1-zhat
to get the estimates of $\beta$ and therefore $\lambda_i$, and then maximize:
$\sum (\mathbb{E}z_i\log{p_i} + (1-\mathbb{E}z_i)\log{(1-p_i)})$
with respect to the coefficient vector of your matrix $\mathbf{G}$ to get the estimates of $p_i$. The expected values $\mathbb{E}z_i = p_i/(p_i+(1-p_i)\exp{(-\lambda_i)})$, again calculated at each iteration.
If you want to do this for real data, as opposed to just understanding the algorithm, R packages already exist; here's an example http://www.ats.ucla.edu/stat/r/dae/zipoisson.htm using the pscl
library.
EDIT: I should emphasize that what we are doing is maximizing the expected value of the complete-data log likelihood, NOT maximizing the complete-data log likelihood with the expected values of the missing data/latent variables plugged in. As it happens, if the complete-data log likelihood is linear in the missing data, as it is here, the two approaches are the same, but otherwise, they aren't.
Best Answer
The online VB algorithm for LDA implemented in gensim and scikit in addition to computing topic distributions on new unseen documents implicitly computes $q(z_{id} = k) = \phi_{dwk}$ which is the probability of assigning a given a word $i$ in document $d$ to topic $k$. It is computed implicitly to save storage space (where for online LDA: $d$ is a batch-size of documents instead of the entire corpus). $\phi_{dwk}$ can be seen as a proxy to what the collapsed Gibbs sampler is doing by assigning topic labels to individual words in every document.