Logistic – How to Understand Link Function in Generalized Linear Model

generalized linear modellinearlink-functionlogistic

I am still trying to learn (may be the terminology issue) what does "link function" mean. For example, in logistic regression, we assume response variable is coming form binomial distribution.

The $\text{logit}^{-1}$ link function convert a real number from $(-\infty, -\infty)$ (output from $\beta^{\top}x$) to a probability number $[0,1]$. But how does it "link" to a binomial distribution which is a discrete distribution?

I understand the "link" is between a real number to a probability number, but there is some missing part from probability number to binomial distribution.

Am I right?

Best Answer

So when you have binary response data, you have a "yes/no" or "1/0" outcome for each observation. However, what you are trying to estimate when doing a binary response regression is not a 1/0 outcome for each set of values of the independent variables you impose, but the probability that an individual with such characteristics will result in a "yes" outcome. Then the response is not discrete anymore, it's continuous (in the (0,1) interval). The response in the data (the true $y_i$) are, indeed, binary, but the estimated response (the $\Lambda(x_i'b)$ or $\Phi(x_i'b)$) are probabilities.

The underlying meaning of these link functions is that they are the distribution we impose to the error term in the latent variable model. Imagine each individual has an underlying (unobservable) willingness to say "yes" (or be a 1) in the outcome. Then we model this willingness as $y_i^*$ using a linear regression on the individual's characteristics $x_i$ (which is a vector in multiple regression):

$$y_i^*=x_i'\beta + \epsilon_i.$$

This is what is called a latent variable regression. If this individual's willingness was positive ($y_i^*>0$), the individual's observed outcome would be a "yes" ($y_i=1$), otherwise a "no". Note that the choice of threshold doesn't matter as the latent variable model has an intercept.

In linear regression we assume the error term to be normally distributed. In binary response and other models, we need to impose/assume a distribution on the error terms. The link function is the cumulative probability function that the error terms follow. For instance, if it is logistic (and we will use that the logistic distribution is symmetric in the fourth equality),

$$P(y_i=1)=P(y_i^*>0)=P(x_i'\beta + \epsilon_i>0)=P(\epsilon_i>-x_i'\beta)=P(\epsilon_i<x_i'\beta)=\Lambda(x_i'\beta).$$

If you assumed the errors to be normally distributed, then you would have a probit link, $\Phi(\cdot)$, instead of $\Lambda(\cdot)$.

Related Question