Solved – Logistic regression and latent data

latent-variablelogisticregression

Assume a simple logistic regression model: given binary data $y_1,\ldots,y_N$ where for each $1 \leq i \leq N$ the outcome of $y_i$ depends on one variable. The succes probability is $p_i = \mathbb{P}(y_i = 1|x_i)$ is then modeled as a function of $x_i$ by the following relation
$$\ln\left(\frac{p_i}{1-p_i}\right) = \beta_0+\beta_1 x_i $$
In some cases, they use latent variables $Z$ by defining $Z_i \geq 0 \Leftrightarrow y_i = 1$ and $Z_i <0 \Leftrightarrow y_i = 0$ and then define the regression model
$$Z_i = \beta_0+\beta_1 x_i + \epsilon_i$$
Is there any particular reason why the latent variable approach is more useful? Furthermore, when using the original logistic model above we can plot $p_i$ in function of $x_i$. How does that work for the latent variable approach? I don't fully understand the main idea behind this approach.

Best Answer

The main selling point for the latent variable representation of logistic regression is its link to a theory of (rational) choice. Sometimes that is extremely useful, but sometimes it makes no sense (and often we are somewhere in between). If we study whether a particular drug increases ones chance of getting better, then it makes little sense to assume that the patients choose between remaining ill and getting better. So in that case I would use the representation in terms of log-odds. If we start with a rational choice theory on why people do something, and want to test that theory, then the latent variable representation would often make sense.

Related Solutions

Generalized Linear Model – Using EM Algorithm to Calculate MLEs for a Zero Inflated Poisson Model

The root of the difficulty you are having lies in the sentence:

Then using the EM algorithm, we can maximize the second log-likelihood.

As you have observed, you can't. Instead, what you maximize is the expected value of the second log likelihood (known as the "complete data log likelihood"), where the expected value is taken over the $z_i$.

This leads to an iterative procedure, where at the $k^{th}$ iteration you calculate the expected values of the $z_i$ given the parameter estimates from the $(k-1)^{th}$ iteration (this is known as the "E-step",) then substitute them into the complete data log likelihood (see EDIT below for why we can do this in this case), and maximize that with respect to the parameters to get the estimates for the current iteration (the "M-step".)

The complete-data log likelihood for the zero-inflated Poisson in the simplest case - two parameters, say $\lambda$ and $p$ - allows for substantial simplification when it comes to the M-step, and this carries over to some extent to your form. I'll show you how that works in the simple case via some R code, so you can see the essence of it. I won't simplify as much as possible, since that might cause a loss of clarity when you think of your problem:

# Generate data
# Lambda = 1,  p(zero) = 0.1
x <- rpois(10000,1)
x[1:1000] <- 0

# Sufficient statistic for the ZIP
sum.x <- sum(x)

# (Poor) starting values for parameter estimates
phat <- 0.5
lhat <- 2.0

zhat <- rep(0,length(x))
for (i in 1:100) {
  # zhat[x>0] <- 0 always, so no need to make the assignment at every iteration
  zhat[x==0] <- phat/(phat +  (1-phat)*exp(-lhat))

  lhat <- sum.x/sum(1-zhat) # in effect, removing E(# zeroes due to z=1)
  phat <- mean(zhat)   

  cat("Iteration: ",i, "  lhat: ",lhat, "  phat: ", phat,"\n")
}

Iteration:  1   lhat:  1.443948   phat:  0.3792712 
Iteration:  2   lhat:  1.300164   phat:  0.3106252 
Iteration:  3   lhat:  1.225007   phat:  0.268331 
...
Iteration:  99   lhat:  0.9883329   phat:  0.09311933 
Iteration:  100   lhat:  0.9883194   phat:  0.09310694

In your case, at each step you'll do a weighted Poisson regression where the weights are 1-zhat to get the estimates of $\beta$ and therefore $\lambda_i$, and then maximize:

$\sum (\mathbb{E}z_i\log{p_i} + (1-\mathbb{E}z_i)\log{(1-p_i)})$

with respect to the coefficient vector of your matrix $\mathbf{G}$ to get the estimates of $p_i$. The expected values $\mathbb{E}z_i = p_i/(p_i+(1-p_i)\exp{(-\lambda_i)})$, again calculated at each iteration.

If you want to do this for real data, as opposed to just understanding the algorithm, R packages already exist; here's an example http://www.ats.ucla.edu/stat/r/dae/zipoisson.htm using the pscl library.

EDIT: I should emphasize that what we are doing is maximizing the expected value of the complete-data log likelihood, NOT maximizing the complete-data log likelihood with the expected values of the missing data/latent variables plugged in. As it happens, if the complete-data log likelihood is linear in the missing data, as it is here, the two approaches are the same, but otherwise, they aren't.

Solved – Interpretation of Saturated Model vs. Model with Interaction and One Main Effect

Throughout my answer, the usual conditional mean independence $\mathbb{E}(\varepsilon_{i}\vert X_{i},Z_{i})=0$ is maintained.

It is instructive to consider a concrete example. Let $X_{i}$ be a dummy of college education, such that $X_{i}=1$ if worker $i$ is a college graduate, and $X_{i}=0$ otherwise; and let $Z_{i}$ be a dummy of gender, such that $Z_{i}=1$ if $i$ is male, and $0$ if $i$ is female. And suppose $Y_{i}$ is the observed income. Hence $\mathbb{E}(Y_{i}\vert X_{i}=1,Z_{i}=1)$ is the expected income of a male college graduate, and $\mathbb{E}(Y_{i}\vert X_{i}=1,Z_{i}=0)$ is the expected income of a female college graduate. Other conditional expectations, such as $\mathbb{E}(Y_{i}\vert X_{i}=0,Z_{i}=0$), have similar interpretations.

First, it is not hard to verify that the coefficients $\alpha_{2}$ equals to $$ \alpha_{2}=\mathbb{E}(Y_{i}\vert X_{i}=1,Z_{i}=1)-\mathbb{E}(Y_{i}\vert X_{i}=1,Z_{i}=0). $$ This is the difference of the expected income of male and female college graduates. The significance of $\alpha_{2}$ may indicate gender discrimination among college graduates.

Next, we have $$ \beta_{2}+\beta_{3}=\alpha_{2}=\mathbb{E}(Y_{i}\vert X_{i}=1,Z_{i}=1)-\mathbb{E}(Y_{i}\vert X_{i}=1,Z_{i}=0). $$ And $$ \beta_{0}+\beta_{2}=\mathbb{E}(Y_{i}\vert X_{i}=0,Z_{i}=1),\ \beta_{0}=\mathbb{E}(Y_{i}\vert X_{i}=0,Z_{i}=0). $$ So $$ \beta_{2}=\mathbb{E}(Y_{i}\vert X_{i}=0,Z_{i}=1)-\mathbb{E}(Y_{i}\vert X_{i}=0,Z_{i}=0), $$ which measures the gender discrimination among workers without college degrees. And $\beta_{3}=(\beta_{2}+\beta_{3})-\beta_{2}$, that is $$ \beta_{3}=\{\mathbb{E}(Y_{i}\vert X_{i}=1,Z_{i}=1)-\mathbb{E}(Y_{i}\vert X_{i}=1,Z_{i}=0)\}-\{\mathbb{E}(Y_{i}\vert X_{i}=0,Z_{i}=1)-\mathbb{E}(Y_{i}\vert X_{i}=0,Z_{i}=0)\}. $$ So $\beta_{3}$ can be understood the difference of the magnitudes of gender discrimination in two cohorts, workers with college education and workers without college degree. The positive sign of $\beta_{3}$ indicates that the gender discrimation among higher educated workers is greater than it is in less educated workers.

Last but not least, one important assumption made implicitly by model (1) is the following $$ \mathbb{E}(Y_{i}\vert X_{i}=0,Z_{i}=0)=\mathbb{E}(Y_{i}\vert X_{i}=0,Z_{i}=1)=\mathbb{E}(Y_{i}\vert X_{i}=0)=\alpha_{0}. $$ That is by specifying model (1), one has assumed that there is no wage discrimination against gender for those who have no college degree. The expectation $\mathbb{E}(Y_{i}\vert X_{i}=0,Z_{i}=0)$ and $\mathbb{E}(Y_{i}\vert X_{i}=0,Z_{i}=1)$ are the expected income of female and male worker without college education, respectively. Such an assumption in general may or may not hold, depending on your empirical exercise.

Best Answer

Related Solutions

Generalized Linear Model – Using EM Algorithm to Calculate MLEs for a Zero Inflated Poisson Model

Solved – Interpretation of Saturated Model vs. Model with Interaction and One Main Effect

Related Question