Solved – Should predicted probabilities from Logistic Regression correspond with percentages

data visualizationlogisticregression

I have conducted a logistic regression in order to identify whether student status (student/non-student), time period (time 1, 2 or 3), or condition (condition 1 or condition 2) predict a binary outcome (buying lunch or purchasing lunch).

I have plotted the predicted probabilities that are saved as a result of the logistic regression to visualise the data. These show a decrease in probability of lunch being bought in time 2 for one of the conditions. However, when looking at the percentages of people who bought their lunch (rather than the predicted probabilities), there is an increase at time 2.

Is it possible for there to be an increase in terms of percentages but a decrease in terms of probabilities, or does this indicate that something has gone wrong with the model?

Best Answer

It is certainly the ideal for a logistic regression model's predicted probabilities to match the observed proportions, but a given model may not and still be just fine, and a poorly-fitting model will not. My guess is that your model doesn't fit well.

If your data constitute a single categorical variable, the logistic regression model will yield predicted probabilities that exactly mirror the proportions, because a parameter is being fitted for every level of the variable. However, if you have multiple categorical variables, that won't necessarily be the case. If you are only fitting main effects, then the model (or you) is assuming the the level in each combination is just the sum of the effects of each constituent variable.

The first possibility is that that is true. Nonetheless, because there is sampling variability (your sample will not be a perfect reflection of the population) the predicted probabilities will not exactly match the observed proportions, but that's OK.

The second possibility is that the effects are not additive. That is, there is an interaction among the variables. (From your description, I'm guessing this is what is happening in your case.) If so, you just need to add the appropriate interaction terms. If you add all possible interaction terms (both two-way and higher level interactions), you will get back to a model that contains a parameter for every possible combination of variable levels; then the model will perfectly fit the data again, although that may be overfitting. In that sense, it is possible that some intermediate number of interaction terms is appropriate, in which case the predicted probabilities will typically differ somewhat (but an appropriate amount) from the observed proportions. To see these points in action, it may help to read my answer here: Test logistic regression model using residual deviance and degrees of freedom.


Above, I discuss categorical variables, because I understand that to be your situation. However, we can extend this to continuous variables. When working with observational data, the x-values are typically scattered each at a unique location such that it is difficult to compute any observed proportions and different techniques need to be used. With an experimental setup, data might be replicated at prespecified values, in which case you could code them as categorical levels (thus giving you a perfect fit analogous to the discussion above). For the sake of this discussion, let us assume that the data are replicated at specific values, but are not being coded as levels of a categorical variable.

Now, even if the model is exactly correct, you should expect that the observed proportions will bounce around the predicted probabilities and not all be exactly equal. (They should be bouncing around a moderate and appropriate amount.) However, it is again possible that they do not match well and that something is clearly wrong. There are a couple ways this could happen. One is that you have used the wrong link function and the correct link has a distinctly different shape (cf., my answer here: Difference between logit and probit models). Another possibility is that the function mapping the x-variable to the response is not linear in the transformed space (cf., my answer here: How to use boxplots to find the point where values are more likely to come from different conditions?).

Related Question