Solved – Variation explained in ordinal logistic regression models

logisticordered-logitrr-squaredregression

I have made these three ordinal logistic regression models:

model1 <-  polr(as.factor(carb) ~ mpg,  Hess = T, mtcars)
model2 <-  polr(as.factor(carb) ~ hp,   Hess = T, mtcars)
model3 <-  polr(as.factor(carb) ~ drat, Hess = T, mtcars)

To figure out if the models are a good fit to the data, I calculated the proportion of variation explained like this:

model_null <-  polr(as.factor(carb) ~ 1, Hess = T, mtcars)

1-(model1$deviance/model_null$deviance)
0.1512784
1-(model2$deviance/model_null$deviance)
0.2520109
1-(model3$deviance/model_null$deviance)
0.003453936

Questions:

Why doesn't summary give null deviance?
Have I calculated the proportion of variation explained correctly?
Am I right in saying model1 and model3 explain little variation in carb, but model2 explains 25% of the variation in carb?

Best Answer

Even for logistic regression with a dichotomous DV, there is no exact equivalent of $R^2$ (proportion of variance explained) nor any consensus on which approximation is best. Here is Paul Allison's explanation. However, the version Allison likes best is that of Tjur:

But there’s another $R^2$, recently proposed by Tjur (2009), that I’m inclined to prefer over McFadden’s. It has a lot of intuitive appeal, its upper bound is 1.0, and it’s closely related to $R^2$ definitions for linear models. It’s also easy to calculate.

The definition is very simple. For each of the two categories of the dependent variable, calculate the mean of the predicted probabilities of an event. Then, take the difference between those two means. That’s it!

Unfortunately, with more than two categories, there will be more than one difference; perhaps, however, some statistic based on this could be used. However, Allison is not optimistic about this:

Another potential complaint is that the Tjur $R^2$ cannot be easily generalized to ordinal or nominal logistic regression. For McFadden and Cox-Snell, the generalization is straightforward.

Of those two, Allison now prefers McFadden:

Here are the details. Logistic regression is, of course, estimated by maximizing the likelihood function. Let $L_0$ be the value of the likelihood function for a model with no predictors, and let $L_M$ be the likelihood for the model being estimated. McFadden’s $R^2$ is defined as

$R^2_{McF} = 1\ –\ \ln(L_M) / \ln(L_0)$

Related Solutions

Solved – Interpretation of ordinal logistic regression

You have perfectly confused odds and log odds. Log odds are the coefficients; odds are exponentiated coefficients. Besides, the odds interpretation goes the other way round. (I grew up with econometrics thinking about the limited dependent variables, and the odds interpretation of the ordinal regression is... uhm... amusing to me.) So your first statement should read, "As mpg increases by one unit, the odds of observing category 1 of carb vs. other 5 categories increase by 21%."

As far as the interpretation of the thresholds goes, you really have to plot all of the predicted curves to be able to say what the modal prediction is:

mpg   <- seq(from=5, to=40, by=1)
xbeta <- mpg*(-0.2335)
logistic_cdf <- function(x) {
  return( 1/(1+exp(-x) ) )
}

p1 <- logistic_cdf( -6.4706 - xbeta )
p2 <- logistic_cdf( -4.4158 - xbeta ) - logistic_cdf( -6.4706 - xbeta )
p3 <- logistic_cdf( -3.8508 - xbeta ) - logistic_cdf( -4.4158 - xbeta )
p4 <- logistic_cdf( -1.2829 - xbeta ) - logistic_cdf( -3.8508 - xbeta )
p6 <- logistic_cdf( -0.5544 - xbeta ) - logistic_cdf( -1.2829 - xbeta )
p8 <- 1 - logistic_cdf( -0.5544 - xbeta )

plot(mpg, p1, type='l', ylab='Prob')
  lines(mpg, p2, col='red')
  lines(mpg, p3, col='blue')
  lines(mpg, p4, col='green')
  lines(mpg, p6, col='purple')
  lines(mpg, p8, col='brown')
  legend("topleft", lty=1, col=c("black", "red", "blue", "green", "purple", "brown"), 
         legend=c("carb 1", "carb 2", "carb 3", "carb 4", "carb 5", "carb 6"))

enter image description here

The blue curve for the 3rd category never picked up, and neither did the purple curve for the 6th category. So if anything I would say that for values of mpg above 27 have, the most likely category is 1; between 18 and 27, category 2; between 4 and 18, category 4; and below 4, category 8. (I wonder what it is that you are studying -- commercial trucks? Most passenger cars these days should have mpg > 25). You may want to try to determine the intersection points more accurately.

I also noticed that you have these weird categories that go 1, 2, 3, 4, then 6 (skipping 5), then 8 (skipping 7). If 5 and 7 were missing by design, that's fine. If these are valid categories that carb just does not fall into, this is not good.

Zero-Inflated Poisson Regression – When to Use Zero-Inflated Poisson Regression and Negative Binomial Distribution

I suspect that your problem may be that the default behavior of predict.glm isn't what you think it is.

Specifically, predict used on a glm object will by default gives a response on the scale of the linear predictors, not the response.

This is quite clearly stated in the help (?predict.glm) but seems to trip people up very often (suggesting the default ought to be changed, perhaps; you might like to raise it on the relevant mailing list).

To get the values you want, try predict(model1,type="response")

Best Answer

Related Solutions

Solved – Interpretation of ordinal logistic regression

Zero-Inflated Poisson Regression – When to Use Zero-Inflated Poisson Regression and Negative Binomial Distribution

Related Question