Solved – Logistic regression and latent data

latent-variablelogisticregression

Assume a simple logistic regression model: given binary data $y_1,\ldots,y_N$ where for each $1 \leq i \leq N$ the outcome of $y_i$ depends on one variable. The succes probability is $p_i = \mathbb{P}(y_i = 1|x_i)$ is then modeled as a function of $x_i$ by the following relation
$$\ln\left(\frac{p_i}{1-p_i}\right) = \beta_0+\beta_1 x_i $$
In some cases, they use latent variables $Z$ by defining $Z_i \geq 0 \Leftrightarrow y_i = 1$ and $Z_i <0 \Leftrightarrow y_i = 0$ and then define the regression model
$$Z_i = \beta_0+\beta_1 x_i + \epsilon_i$$
Is there any particular reason why the latent variable approach is more useful? Furthermore, when using the original logistic model above we can plot $p_i$ in function of $x_i$. How does that work for the latent variable approach? I don't fully understand the main idea behind this approach.

Best Answer

The main selling point for the latent variable representation of logistic regression is its link to a theory of (rational) choice. Sometimes that is extremely useful, but sometimes it makes no sense (and often we are somewhere in between). If we study whether a particular drug increases ones chance of getting better, then it makes little sense to assume that the patients choose between remaining ill and getting better. So in that case I would use the representation in terms of log-odds. If we start with a rational choice theory on why people do something, and want to test that theory, then the latent variable representation would often make sense.