Solved – Regression diagnostics for ordered logistic regression

diagnosticordered-logitregressionstata

I am doing a regression with an ordinal dependent variable (answers ranging from very good to very bad) for the first time. The model itself seems to be working fine.

I have no idea however which regression diagnostics I have to run to account for the model fit. I searched the internet but most pages just focus on the diagnostics for OLS or logistic regression with a binary DV.

It would very helpful if someone could tell me which diagnostics are essential for an ordered logit model and ideally how to conduct and interpret them in Stata.

Best Answer

R has a package called sure, which uses SUrrogate REsiduals for diagnostics associated with cumulative link ordinal regression models. The package can be used to detect model misspecification with respect to mean structures, link functions, heteroscedasticity, proportionality, and interaction effects. It doesn't look like Stata has anything similar implemented.

To learn more about the package functionality, you can refer to the research article Residuals and Diagnostics for Binary and Ordinal Regression Models: An Introduction to the sure Package by Greenwell et al. (The R Journal Vol. 10/1, July 2018), which you can find here: https://journal.r-project.org/archive/2018/RJ-2018-004/RJ-2018-004.pdf.

The surrogate approach to defining residuals for an ordinal outcome Y was introduced in the paper Residuals and Diagnostics for Ordinal Regression Models: A Surrogate Approach by Dungang Liu and Heping Zhang (Journal of the American Statistical Association vol. 113,522 (2018): 845-854). The idea underlying this approach is to define a continuous variable S as a “surrogate” of Y and then obtain residuals based on S. The paper is available here: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6133273/.

Related Solutions

Mixed Model vs Ordered Probit vs Ordered Logit – Comparing Linear Models with Ordinal Response

This is going to be at best a partial answer but hope it helps a little.

Given that your response is ordinal you have to ask yourself whether the distance between different categories is different depending on the starting position. In other words. If you think the gap between 1 and 3 is not necessarily the same gap as the gap between 2 and 4, then using a cumulative link model (e.g. logit or probit) is the best option. I'd recommend reading the tutorial and extra information from Christensen (2013) on the ordinal package to help you along the way.

Why people prefer lmer() has probably less to do with good statistics or econometrics and more with habit, and method institutionalisation. I know from experience that coming up with a CLM model while most people use a GLS or OLS or so can be ill-advised not because the CLM is not a better model, but because you are basically telling your community of readers "so far you guys were wrong" which is not that easy to swallow.

Centering and standardizing is often done because people think it will alleviate specific concerns of collinearity and so. There is much debate all over the place about whether this is true but in my opinion (and the same goes for log-transformations) you are reducing variance and changing the data which is only a good idea if you have a theoretical motivation for this, if not, work with the actual data and change your model.

As to the choice of logit versus probit. The difference is not that big in general. Once again I would be guided by the data. You can start as follows:

# Finding the best matching link function
links <- c("logit", "probit", "cloglog", "loglog", "cauchit")
sapply(links, function(link){
  clm(formula, data=df, link=link)$logLik})
    # See which one fits best

# Finding the best threshold function
thresholds <- c("symmetric", "flexible", "equidistant")
sapply(thresholds, function(threshold){
  clm(formula, data=df, link="select best fitting link function",threshold=threshold)$logLik
})

This will give you the best fit between your data depending on the model (lowest log-likelihood). In general, the probit is better if you have a lot of "extreme" values (i.e. worst and best in your case) because it is tied to the normal distribution which has fatter tails than the logistic one (logit).

Finally, here is some simple code that will give you a very good idea of how well your model (and X variables) allow you to predict the response variable. This code will plot the incidence of right responses and the number of wrong predictions (and how wrong they are) in your model.

pred. <- predict(FORMULA, type = "class")$fit
    plot(df$RESPONSE,pred.,type="p", pch=15,cex = sqrt(table(df$RESPONSE,pred.))/5)
     # YOU WILL SEE THIS PLOT IS NOT THAT USEFUL
    results <- data.frame(cbind(as.numeric(as.character(df$RESPONSE)),as.numeric(as.character(pred.)),as.numeric(as.character(df$RESPONSE))-as.numeric(as.character(pred.))))
      sum(results[,3]) # THIS WILL GIVE YOU A FAST IDEA ABOUT WHETHER YOU OVERESTIMATE (LARGE POSITIVE VALUE, OR UNDERESTIMATE, THE ACTUAL RESPONSE
      results$dum <- 1
  tmp <- data.frame(with(results, tapply(dum, results[,3],sum)))
  tmp$z <- seq(min(results[,3]),max(results[,3]),by=1) 
      plot(tmp$z,tmp[,1], type="h", xlab ="Deviation from correct prediction",ylab="Number of Predictions")

Let me know what you think!

Simon

Solved – Is the linear probability model generalisable to ordered logit/probit regressions

That is not a robustness check because the ordinary linear model is guaranteed not to fit. It will yield probabilities estimates outside $[0,1]$. A better approach to checking the assumptions of an ordinal regression model are:

First, relax the assumptions allowing for nonlinear effects using regression splines
Then, check the equal slopes (parallelism) assumption.

For the logistic ordinal model (proportional odds model) the equal slopes (proportional odds) assumption can be checked in several ways, including:

Fit a series of binary logistic models for different cutoffs of $Y$ and plot the regression coefficients vs. cutoff and check for constancy
Construct partial residual plots for all cutoffs of $Y$. One often has to collapse infrequent $Y$ categories to carry this out.
Plot the logit of the empirical cumulative distribution of $Y$ stratified by very important predictors and check for parallelism.

Best Answer

Related Solutions

Mixed Model vs Ordered Probit vs Ordered Logit – Comparing Linear Models with Ordinal Response

Solved – Is the linear probability model generalisable to ordered logit/probit regressions

Related Question