How does one interpret the fact that a dataset has a poor fit / lack of fit with respect to a logistic regression model? I can make sense, for example, of a lack of fit in the case of a linear regression: the data cannot be modeled linearly. But I can't make sense of a lack of fit for a logistic regression. Do we just mean that there is no S-curve that effectively models the probability distribution of the data (edit per the top comment: log odds of the data)?
Solved – Understanding lack of fit in logistic regression
goodness of fitlogisticlogitmodelingregression
Related Solutions
We can rationalize this as follows:
Underlying logistic regression is a latent (unobservable) linear regression model:
$$y^* = X\beta + u$$
where $y^*$ is a continuous unobservable variable (and $X$ is the regressor matrix). The error term is assumed, conditional on the regressors, to follow the logistic distribution, $u\mid X\sim \Lambda(0, \frac {\pi^2}{3})$.
We assume that what we observe, i.e. the binary variable $y$, is an Indicator function of the unobservable $y^*$:
$$ y = 1 \;\;\text{if} \;\;y^*>0,\qquad y = 0 \;\;\text{if}\;\; y^*\le 0$$
Then we ask "what is the probability that $y$ will take the value $1$ given the regressors (i.e. we are looking at a conditional probability). This is
$$P(y =1\mid X ) = P(y^*>0\mid X) = P(X\beta + u>0\mid X) = P(u> - X\beta\mid X) \\= 1- \Lambda (-Χ\beta) = \Lambda (X\beta) $$
the last equality due to the symmetry property of the logistic cumulative distribution function.
So we have obtained the basic logistic regression model
$$p=P(y =1 \mid X) = \Lambda (X\beta) = \frac 1 {1+e^{-X\beta}}$$
After that, the other answers give you how we manipulate this expression algebraically to arrive at $$\log \frac {p}{1 - p} = X\beta $$
It is therefore the initial linear assumption/specification related to the Latent variable $y^*$, that leads to this last relation to hold.
Note that $\log \frac {p}{1 - p}$ is not equal to the latent variable $y^*$ but rather $y^* = \log \frac {p}{1 - p} + u$
@Laconic's answer is great and complete, in my opinion. Something I wanted to add is that the original coefficients describe a difference in the log odds for two units who differ by 1 in the predictor. E.g., for a coefficient on $X$ of 5, we can say that the difference in log odds between two units who differ on $X$ by 1 is 5. Mathematically,
$$\beta = \log(\text{odds}(p|X=x_0+1))-\log(\text{odds}(p|X=x_0)) $$
When you exponentiate $\beta$, you get
$$\exp(\beta) = \exp(\log(\text{odds}(p|X=x_0+1))-\log(\text{odds}(p|X=x_0))) \\ = \frac{\exp(\log(\text{odds}(p|X=x_0+1)))}{\exp(\log(\text{odds}((p|X=x_0)))} \\ = \frac{\text{odds}(p|X=x_0+1)}{\text{odds}(p|X=x_0))}$$
which is a ratio of odds, an odds ratio.
Best Answer
In logistic regression, you are modeling the probabilities of 'success' (i.e., that $P(Y_i=1)$). Thus, ultimately the lack of fit is just that the model's predicted probabilities do not follow the true probabilities (although of course, we don't really know the true probabilities).
Now the model will fit the observed proportions in the data (that's how the coefficients are estimated), so you wouldn't think this should be a problem. However, models usually have constraints relative to the data. That doesn't have to be the case: consider a one-way ANOVA-ish logistic regression that compares the probability of success associated with three nominal categories. In such a case there can be no lack of fit; the model's predicted probabilities will exactly equal the observed proportions in the three conditions. But imagine a slightly more complicated two-way ANOVA-ish logistic regression where those three conditions are crossed with a second, dichotomous factor. If a model with two factors, but no interaction, is fit, the coefficients are constrained such that the predicted probabilities for $Aa$ and $Ab$, $Ba$ and $Bb$, and $Ca$ and $Cb$ must be a constant shift (on the log odds scale). That may not be correct: an interaction term may be needed. If an interaction term is included in the model, no lack of fit is possible (although it may not be necessary), but when an interaction is not included, lack of fit could occur. You can see an example of this in my answer here: Test logistic regression model using residual deviance and degrees of freedom.
Nominal covariates constitute the the simplest case, but other possibilities exist. When the covariates are continuous, the functional relationship can differ from that specified by the model. There are various ways this can occur:
One might be that the true probabilities have a natural 'floor' and/or 'ceiling'. Imagine modeling the probability students get a $4$ option multiple-choice question correct. When students don't know at all, we expect the probability to drop to $.25$, not $0$. But a simple logistic regression model must yield predicted probabilities that asymptote to $0$ as values of $X$ become ever more extreme in one direction.
Another is that the relationship between the covariate and the predicted probabilities is not linear on the log odds scale. You can see an example of this with (presumably) real data in my answer here: How to use boxplots to find the point where values are more likely to come from different conditions?
A final possibility is that the relationship is linear, but on a different scale than the log odds (which is what logistic regression models). That is, the link function is misspecified. Note that this is a subtle form of the issue in #2 above. You can see how different link functions can pick out different relationships between $X$ and the predicted probabilities in the figure in my answer here: Difference between logit and probit models. Because many link functions tend to be similar, this last possibility can be difficult to detect, to quote from that answer: