Logistic Regression – Interpretation of Logit vs. Independent Variable Plot

data visualizationgeneralized linear modellogisticmultiple regressionr

I plotted the log odds of my outcome variable against my predictor variables, hwt and ist. There is a hard vertical line in my hwt plot and a hard diagonal line in my ist plot. I have two questions: (1) Why do I have hard lines in my plots (I don’t understand what would cause this), and (2) does the ist plot satisfy the assumption for linearity for logistic regression? It looks “fairly” linear to me. My dataset consists of 300 observations with 269 outcomes being a "0" and 31 outcomes being a "1”. The ist predictor value is a percentage, which ranges from 0-100%. The hwt variable is measured in m/ha. This is the code I used to generate the plots:


bestmodel <- glm(outcome~ hwt + ist, data = habitatdata, family = "binomial")

library(ggplot2)
probabilities <- predict(bestmodel, type = "response")
logit = log(probabilities/(1-probabilities))
ggplot(habitatdata, aes(ist, logit))+
 geom_point(size = 0.5, alpha = 0.5) +
geom_smooth(method = loess) +
 theme_bw()

ggplot(habitatdata, aes(hwt, logit))+
      geom_point(size = 0.5, alpha = 0.5) +
      geom_smooth(method = loess) +
      theme_bw()

enter image description here

enter image description here

Best Answer

The diagonal and vertical lines of points are caused by the same individuals, which are those with values of the 0 for hwt.

Let's consider the second plot first. That line of points is concentrated at hwt = 0, but there is some variability in the predicted logit. That is because those individuals have different values of ist, which produces different values of the logit given the estimated regression coefficients. There is nothing strange about this.

The first plot may look strange, but it's actually a result of the same phenomenon. Consider all those units with zeroes for hwt and think about what the regression equation looks like for them. Becuase hwt = 0, all that is left is a linear relationship between iwt and the logit, and that is reflected in that diagonal line. (Note this kind of thing would happen if you had a clustering of values at the exact same value of hwt, even if that value was not zero). More specifically, if the estimated regression equation looked like $\text{logit}(p) = b_0 + b_1 (\text{iwt}) + b_2 (\text{hwt})$, see what the equation looks like for those with hwt = 0: all we have is $\text{logit}(p) = b_0 + b_1 (\text{iwt})$, which is a perfectly straight diagonal line.

To verify this, add aes(fill = hwt == 0) to your geom_point() call. You will see that those on the diagonal line in the first plot correspond exactly to those on the vertical line in the second plot, being those with hwt = 0.

Related Question