We can rationalize this as follows:
Underlying logistic regression is a latent (unobservable) linear regression model:
$$y^* = X\beta + u$$
where $y^*$ is a continuous unobservable variable (and $X$ is the regressor matrix). The error term is assumed, conditional on the regressors, to follow the logistic distribution, $u\mid X\sim \Lambda(0, \frac {\pi^2}{3})$.
We assume that what we observe, i.e. the binary variable $y$, is an Indicator function of the unobservable $y^*$:
$$ y = 1 \;\;\text{if} \;\;y^*>0,\qquad y = 0 \;\;\text{if}\;\; y^*\le 0$$
Then we ask "what is the probability that $y$ will take the value $1$ given the regressors (i.e. we are looking at a conditional probability). This is
$$P(y =1\mid X ) = P(y^*>0\mid X) = P(X\beta + u>0\mid X) = P(u> - X\beta\mid X) \\= 1- \Lambda (-Χ\beta) = \Lambda (X\beta) $$
the last equality due to the symmetry property of the logistic cumulative distribution function.
So we have obtained the basic logistic regression model
$$p=P(y =1 \mid X) = \Lambda (X\beta) = \frac 1 {1+e^{-X\beta}}$$
After that, the other answers give you how we manipulate this expression algebraically to arrive at
$$\log \frac {p}{1 - p} = X\beta $$
It is therefore the initial linear assumption/specification related to the Latent variable $y^*$, that leads to this last relation to hold.
Note that $\log \frac {p}{1 - p}$ is not equal to the latent variable $y^*$ but rather $y^* = \log \frac {p}{1 - p} + u$
Cons of an identity link in the case of the Poisson regression are:
- As you have mentioned, it can produce out-of-range predictions.
- You may get weird errors and warnings when attempting to fit the model, because the link permits lambda to be less than 0, but the Poisson distribution is not defined for such values.
- As Poisson regression assumes that the mean and variance are the same, when you change the link you are also changing assumptions about the variance. My experience has been that this last point is most telling.
But, ultimately this is an empirical question. Fit both models. Perform whatever checks you like. If the identity link has a lower AIC, and does as well or better on all your other checks, then run with the identity link.
In the case of the logit model vs the linear probability model (i.e., what you refer to as the identity link), the situation is a lot more straightforward. Except for some very exotic cases in econometrics (which you will find if you do a search), the logit model is better: it makes fewer assumptions and is what most people use. Using the linear probability model in its place would verge on being perverse.
As regards interpreting the models, if you are using R, there are two great packages that will do all the heavy lifting: effects, which is super easy to use, and zelig, which is harder to use but great if you want to make predictions.
Best Answer
So when you have binary response data, you have a "yes/no" or "1/0" outcome for each observation. However, what you are trying to estimate when doing a binary response regression is not a 1/0 outcome for each set of values of the independent variables you impose, but the probability that an individual with such characteristics will result in a "yes" outcome. Then the response is not discrete anymore, it's continuous (in the (0,1) interval). The response in the data (the true $y_i$) are, indeed, binary, but the estimated response (the $\Lambda(x_i'b)$ or $\Phi(x_i'b)$) are probabilities.
The underlying meaning of these link functions is that they are the distribution we impose to the error term in the latent variable model. Imagine each individual has an underlying (unobservable) willingness to say "yes" (or be a 1) in the outcome. Then we model this willingness as $y_i^*$ using a linear regression on the individual's characteristics $x_i$ (which is a vector in multiple regression):
$$y_i^*=x_i'\beta + \epsilon_i.$$
This is what is called a latent variable regression. If this individual's willingness was positive ($y_i^*>0$), the individual's observed outcome would be a "yes" ($y_i=1$), otherwise a "no". Note that the choice of threshold doesn't matter as the latent variable model has an intercept.
In linear regression we assume the error term to be normally distributed. In binary response and other models, we need to impose/assume a distribution on the error terms. The link function is the cumulative probability function that the error terms follow. For instance, if it is logistic (and we will use that the logistic distribution is symmetric in the fourth equality),
$$P(y_i=1)=P(y_i^*>0)=P(x_i'\beta + \epsilon_i>0)=P(\epsilon_i>-x_i'\beta)=P(\epsilon_i<x_i'\beta)=\Lambda(x_i'\beta).$$
If you assumed the errors to be normally distributed, then you would have a probit link, $\Phi(\cdot)$, instead of $\Lambda(\cdot)$.