This is going to be at best a partial answer but hope it helps a little.
Given that your response is ordinal you have to ask yourself whether the distance between different categories is different depending on the starting position. In other words. If you think the gap between 1 and 3 is not necessarily the same gap as the gap between 2 and 4, then using a cumulative link model (e.g. logit or probit) is the best option. I'd recommend reading the tutorial and extra information from Christensen (2013) on the ordinal
package to help you along the way.
Why people prefer lmer() has probably less to do with good statistics or econometrics and more with habit, and method institutionalisation. I know from experience that coming up with a CLM model while most people use a GLS or OLS or so can be ill-advised not because the CLM is not a better model, but because you are basically telling your community of readers "so far you guys were wrong" which is not that easy to swallow.
Centering and standardizing is often done because people think it will alleviate specific concerns of collinearity and so. There is much debate all over the place about whether this is true but in my opinion (and the same goes for log-transformations) you are reducing variance and changing the data which is only a good idea if you have a theoretical motivation for this, if not, work with the actual data and change your model.
As to the choice of logit versus probit. The difference is not that big in general. Once again I would be guided by the data. You can start as follows:
# Finding the best matching link function
links <- c("logit", "probit", "cloglog", "loglog", "cauchit")
sapply(links, function(link){
clm(formula, data=df, link=link)$logLik})
# See which one fits best
# Finding the best threshold function
thresholds <- c("symmetric", "flexible", "equidistant")
sapply(thresholds, function(threshold){
clm(formula, data=df, link="select best fitting link function",threshold=threshold)$logLik
})
This will give you the best fit between your data depending on the model (lowest log-likelihood).
In general, the probit is better if you have a lot of "extreme" values (i.e. worst and best in your case) because it is tied to the normal distribution which has fatter tails than the logistic one (logit).
Finally, here is some simple code that will give you a very good idea of how well your model (and X variables) allow you to predict the response variable. This code will plot the incidence of right responses and the number of wrong predictions (and how wrong they are) in your model.
pred. <- predict(FORMULA, type = "class")$fit
plot(df$RESPONSE,pred.,type="p", pch=15,cex = sqrt(table(df$RESPONSE,pred.))/5)
# YOU WILL SEE THIS PLOT IS NOT THAT USEFUL
results <- data.frame(cbind(as.numeric(as.character(df$RESPONSE)),as.numeric(as.character(pred.)),as.numeric(as.character(df$RESPONSE))-as.numeric(as.character(pred.))))
sum(results[,3]) # THIS WILL GIVE YOU A FAST IDEA ABOUT WHETHER YOU OVERESTIMATE (LARGE POSITIVE VALUE, OR UNDERESTIMATE, THE ACTUAL RESPONSE
results$dum <- 1
tmp <- data.frame(with(results, tapply(dum, results[,3],sum)))
tmp$z <- seq(min(results[,3]),max(results[,3]),by=1)
plot(tmp$z,tmp[,1], type="h", xlab ="Deviation from correct prediction",ylab="Number of Predictions")
Let me know what you think!
Simon
Best Answer
R has a package called sure, which uses SUrrogate REsiduals for diagnostics associated with cumulative link ordinal regression models. The package can be used to detect model misspecification with respect to mean structures, link functions, heteroscedasticity, proportionality, and interaction effects. It doesn't look like Stata has anything similar implemented.
To learn more about the package functionality, you can refer to the research article Residuals and Diagnostics for Binary and Ordinal Regression Models: An Introduction to the sure Package by Greenwell et al. (The R Journal Vol. 10/1, July 2018), which you can find here: https://journal.r-project.org/archive/2018/RJ-2018-004/RJ-2018-004.pdf.
The surrogate approach to defining residuals for an ordinal outcome Y was introduced in the paper Residuals and Diagnostics for Ordinal Regression Models: A Surrogate Approach by Dungang Liu and Heping Zhang (Journal of the American Statistical Association vol. 113,522 (2018): 845-854). The idea underlying this approach is to define a continuous variable S as a “surrogate” of Y and then obtain residuals based on S. The paper is available here: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6133273/.