Prediction – How to Use Ordered Probit Regression and Calculate Prediction Accuracy

accuracyordered-probitpredictionrms

I want to do an ordered probit regression, then cross-validate model prediction accuracy with 80% data for training and 20% for validation, and calculate RMSE for predictions.

Consider this dataset:

I did this:

x=c(2.3,3.1,3.5,10.0,6.8,5.0,5.4,3.2)
y=c(1,2,2,5,4,3,2,1)
myData=data.frame(cbind(x,y))

library("MASS")
reg=polr(as.factor(myData$y)~myData$x,data=myData,method="probit")

I saw this question, but I couldn't fully understand. Suppose myValidationData contains 20% of data which I want to use for validation. So, I would do:

fit=predict(reg,type="probs")
x=c(5.6, 5.1)
y=c(3,3)
myValidationData=data.frame(cbind(x,y))

This is how I tried to predict, but is it correct, when I want to cross-validate?

fit=predict(reg,data=myValidationData,type="probs")

How should I measure RMSE? And, how can I plot the prediction?

Best Answer

The R rms package has many capabilities for validating ordinal regression models. Start with the orm function. Note that split-sample validation takes an extremely large sample size to work. You might be better off with bootstrap validate as implemented in the rms validate and calibrate functions.

Measures of predictive accuracy for ordinal $Y$ include

Generalized $c$-index (generalized ROC area) from Somers' $D_{xy}$ rank correlation
Spearman $\rho$
Other rank correlation measures - these are all measures of pure predictive discrimination
Generalized $R^2$ based on model likelihood ratio $\chi^2$ statistic
Calibration accuracy for $Prob(Y \geq y | X)$ using a nonparametric smooth calibration curve

Related Solutions

Solved – Ordered Probit and categorical variables

Results from an ordered logit/probit regression are always unintuitive, but categorical explanatory variables are as meaningful as continuous ones. I'd even say that they are easier to interpret.

For a concrete example, you could look at Dobson, An Introduction to Generalizer Linear Models, 2002, 2nd ed., Chapter 8. In her "car preferences" example, the dependent variable is the importance of air conditioning and power steering (three levels: "no or little importance", "important", "very important") and the two explanatory variables are gender (male or female, coded as 1 and 0) and age (18-23, 24-40, >40, coded as age2440 = 1 or 0, and agegt40 = 1 or 0).

Fitting an ordered probit model you get (I've used R, MASS library, polr() function):

Coefficients:
   male age2440 agegt40 
-0.3467  0.6817  1.3288 

Intercepts:
  NoImp|Imp Imp|VeryImp 
    0.01844     0.97594

Then you can compute the probabilities for women (male = 0) over 40 (age2440 = 0, agegt40 = 1):

NoImp     Imp VeryImp 
0.095   0.267   0.638

and for men over 40 (male = 1):

NoImp     Imp VeryImp 
0.168   0.330   0.502

Their difference is the gender partial effect:

 NoImp     Imp VeryImp 
-0.073  -0.063   0.136

I think that it's meaningful ;-)

Solved – Ordered Probit Regression Results Interpretation

Generally your are estimating probabilities for every category j of your dependent variable y. Similar to marginal effects, not as far as I know. You can estimate the probabilites for the response-categories with mfx in stata if I remember correctly.

Concerning the interpretation of the coefficients UCLA can help: "Standard interpretation of the ordered logit coefficient is that for a one unit increase in the predictor, the response variable level is expected to change by its respective regression coefficient in the ordered log-odds scale while the other variables in the model are held constant."

Best Answer

Related Solutions

Solved – Ordered Probit and categorical variables

Solved – Ordered Probit Regression Results Interpretation

Related Question