Solved – Understanding lack of fit in logistic regression

goodness of fitlogisticlogitmodelingregression

How does one interpret the fact that a dataset has a poor fit / lack of fit with respect to a logistic regression model? I can make sense, for example, of a lack of fit in the case of a linear regression: the data cannot be modeled linearly. But I can't make sense of a lack of fit for a logistic regression. Do we just mean that there is no S-curve that effectively models the probability distribution of the data (edit per the top comment: log odds of the data)?

Best Answer

In logistic regression, you are modeling the probabilities of 'success' (i.e., that $P(Y_i=1)$). Thus, ultimately the lack of fit is just that the model's predicted probabilities do not follow the true probabilities (although of course, we don't really know the true probabilities).

Now the model will fit the observed proportions in the data (that's how the coefficients are estimated), so you wouldn't think this should be a problem. However, models usually have constraints relative to the data. That doesn't have to be the case: consider a one-way ANOVA-ish logistic regression that compares the probability of success associated with three nominal categories. In such a case there can be no lack of fit; the model's predicted probabilities will exactly equal the observed proportions in the three conditions. But imagine a slightly more complicated two-way ANOVA-ish logistic regression where those three conditions are crossed with a second, dichotomous factor. If a model with two factors, but no interaction, is fit, the coefficients are constrained such that the predicted probabilities for $Aa$ and $Ab$, $Ba$ and $Bb$, and $Ca$ and $Cb$ must be a constant shift (on the log odds scale). That may not be correct: an interaction term may be needed. If an interaction term is included in the model, no lack of fit is possible (although it may not be necessary), but when an interaction is not included, lack of fit could occur. You can see an example of this in my answer here: Test logistic regression model using residual deviance and degrees of freedom.

Nominal covariates constitute the the simplest case, but other possibilities exist. When the covariates are continuous, the functional relationship can differ from that specified by the model. There are various ways this can occur:

  1. One might be that the true probabilities have a natural 'floor' and/or 'ceiling'. Imagine modeling the probability students get a $4$ option multiple-choice question correct. When students don't know at all, we expect the probability to drop to $.25$, not $0$. But a simple logistic regression model must yield predicted probabilities that asymptote to $0$ as values of $X$ become ever more extreme in one direction.

  2. Another is that the relationship between the covariate and the predicted probabilities is not linear on the log odds scale. You can see an example of this with (presumably) real data in my answer here: How to use boxplots to find the point where values are more likely to come from different conditions?

  3. A final possibility is that the relationship is linear, but on a different scale than the log odds (which is what logistic regression models). That is, the link function is misspecified. Note that this is a subtle form of the issue in #2 above. You can see how different link functions can pick out different relationships between $X$ and the predicted probabilities in the figure in my answer here: Difference between logit and probit models. Because many link functions tend to be similar, this last possibility can be difficult to detect, to quote from that answer:

    [T]he empirical fit of the model to the data is unlikely to be of assistance in selecting a link, unless the shapes of the link functions in question differ substantially (of which, the logit and probit do not).

Related Question