Ordinal Regression – Why Do statsmodels (Python) and MASS (R) Differ in Intercept Values? How to Resolve

ordered-logitpythonrregression coefficientsstatsmodels

From the same dataset and same ordinal regression model, I get two different intercept outputs with Python's statsmodels library (OrderedModel function) and with R's MASS package (polr function), and I don't understand why.

I don't know if the problem originates from a software issue or from a statistical issue/interpretation, so in doubt I ask it here.

Anyway, here's the polr output:

Call:
polr(formula = satisfaction ~ gender, data = u, Hess = TRUE, 
    model = FALSE)

Coefficients:
             Value Std. Error t value
genderwomen 0.1183     0.1204  0.9833

Intercepts:
                                 Value    Std. Error t value 
not satisfied|somehow satisfied   -1.5239   0.1185   -12.8592
somehow satisfied|very satisfied   1.0782   0.1159     9.2994

Residual Deviance: 5076.661 
AIC: 5082.661

And here's the statsmodels output:

        Optimization terminated successfully.
         Current function value: 0.975156
         Iterations: 79
         Function evaluations: 141
                             OrderedModel Results                             
==============================================================================
Dep. Variable:           satisfaction   Log-Likelihood:                -2538.3
Model:                   OrderedModel   AIC:                             5083.
Method:            Maximum Likelihood   BIC:                             5100.
Date:                Tue, 18 Oct 2022                                         
Time:                        07:42:53                                         
No. Observations:                2603                                         
Df Residuals:                    2600                                         
Df Model:                           3                                         
==================================================================================================================
                                                     coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------------------------------------------
C(gender, Treatment(reference='men'))[T.women]     0.1183      0.120      0.983      0.325      -0.118       0.354
not satisfied/somehow satisfied                   -1.5240      0.119    -12.859      0.000      -1.756      -1.292
somehow satisfied/very satisfied                   0.9564      0.023     42.259      0.000       0.912       1.001
==================================================================================================================

As you can see, all values are similar, except the "somehow satisfied/very satisfied" one (1.0782 in polr, 0.9564 in statsmodels), which is a bit confusing.

Why are they different?

If it can help, here's some code to reproduce the results above:

In R:

library(MASS)
#creating the dataset
h_ps = c("men", "not satisfied", 61)
f_ps = c("women", "not satisfied", 366)
h_ms = c("men", "somehow satisfied", 158)
f_ms = c("women", "somehow satisfied", 1304)
h_ts = c("men", "very satisfied", 83)
f_ts = c("women", "very satisfied", 631)    
data = as.data.frame(rbind(h_ps, f_ps, h_ms, f_ms, h_ts, f_ts))
names(data) <- c("gender", "satisfaction", "Freq")
#function to create a raw dataset from the crosstable, taken from https://stackoverflow.com/a/30863433/3889011
tableinv <- function(x){
  y <- x[rep(rownames(x),x$Freq),1:(ncol(x)-1)]
  rownames(y) <- c(1:nrow(y))
  return(y)}
u = tableinv(data)
u$`satisfaction` <-  factor(u$`satisfaction`, order = TRUE, 
                      levels = c("not satisfied", "somehow satisfied","very satisfied" ))
u$gender <- relevel(factor(u$gender), ref="men")
#ordinal regression model
m <- polr(satisfaction ~ gender, data = u, Hess=TRUE)
summary(m)

In Python:

import pandas
from statsmodels.miscmodels.ordinal_model import OrderedModel
#creating the dataset
crosstab = pandas.DataFrame(columns=["not satisfied", "somehow satisfied", "very satisfied"],
                index=["men", "women"],
                data=[[61,158,83],[366,1304,631]])
s = crosstab.stack([0])
raw_data = s.index.repeat(s).to_frame(index=False)
raw_data.columns = ["gender", "satisfaction"]
raw_data
raw_data["satisfaction"] = pandas.Categorical(raw_data["satisfaction"], ordered=True,
                                              categories=["not satisfied", "somehow satisfied", "very satisfied"])
#ordinal regression model
mod_prob = OrderedModel.from_formula("satisfaction ~ C(gender, Treatment(reference='men'))", raw_data, distr='logit')

res_prob = mod_prob.fit()

print(res_prob.summary())

Additional information:

  • R version 4.1.3, MASS_7.3-58.1
  • Python 3.10.2, statsmodels 0.13.2

Thanks,

Best Answer

I found the answer, this is exactly the same issue described here (Python vs. SPSS): Ordinal Regression: Python vs. SPSS, that is:

The Python coefficients are the logarithms of the differences

(while R handles the coefficient the same way SPSS does.)

Basically, to find the same coefficient as in R, we have to add the exponentiated coefficient ("somehow satisfied/very satisfied", i.e. exp(0.9564)) to the "base" coefficient ("not satisfied/somehow satisfied", i.e. -1.5240), that is -1.5240+exp(0.9564) == 1.078311, which is very close to what polr returns.

Let's check with the exact values in Python to be sure:

import numpy as np
r_value = 1.0782 #output value from polr
print(res_prob.params) #let's check what the parameters look like
print()
python_value = res_prob.params[1] + np.exp(res_prob.params[2])
python_value = round(python_value,4) #as the value from R is probably rounded
if (r_value == python_value) == True:
   print(r_value, "==", python_value)
   print("it works!")

output:

gender_                             0.118348
Not satisfied/somehow satisfied    -1.523956
somehow satisfied/very satisfied    0.956351
dtype: float64

1.0782 == 1.0782
it works!

Thanks to @josef for his comment that confirms what I suspected!