From the same dataset and same ordinal regression model, I get two different intercept outputs with Python's statsmodels library (OrderedModel
function) and with R's MASS package (polr
function), and I don't understand why.
I don't know if the problem originates from a software issue or from a statistical issue/interpretation, so in doubt I ask it here.
Anyway, here's the polr output:
Call:
polr(formula = satisfaction ~ gender, data = u, Hess = TRUE,
model = FALSE)
Coefficients:
Value Std. Error t value
genderwomen 0.1183 0.1204 0.9833
Intercepts:
Value Std. Error t value
not satisfied|somehow satisfied -1.5239 0.1185 -12.8592
somehow satisfied|very satisfied 1.0782 0.1159 9.2994
Residual Deviance: 5076.661
AIC: 5082.661
And here's the statsmodels output:
Optimization terminated successfully.
Current function value: 0.975156
Iterations: 79
Function evaluations: 141
OrderedModel Results
==============================================================================
Dep. Variable: satisfaction Log-Likelihood: -2538.3
Model: OrderedModel AIC: 5083.
Method: Maximum Likelihood BIC: 5100.
Date: Tue, 18 Oct 2022
Time: 07:42:53
No. Observations: 2603
Df Residuals: 2600
Df Model: 3
==================================================================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------------------------------------------
C(gender, Treatment(reference='men'))[T.women] 0.1183 0.120 0.983 0.325 -0.118 0.354
not satisfied/somehow satisfied -1.5240 0.119 -12.859 0.000 -1.756 -1.292
somehow satisfied/very satisfied 0.9564 0.023 42.259 0.000 0.912 1.001
==================================================================================================================
As you can see, all values are similar, except the "somehow satisfied/very satisfied" one (1.0782 in polr, 0.9564 in statsmodels), which is a bit confusing.
Why are they different?
If it can help, here's some code to reproduce the results above:
In R:
library(MASS)
#creating the dataset
h_ps = c("men", "not satisfied", 61)
f_ps = c("women", "not satisfied", 366)
h_ms = c("men", "somehow satisfied", 158)
f_ms = c("women", "somehow satisfied", 1304)
h_ts = c("men", "very satisfied", 83)
f_ts = c("women", "very satisfied", 631)
data = as.data.frame(rbind(h_ps, f_ps, h_ms, f_ms, h_ts, f_ts))
names(data) <- c("gender", "satisfaction", "Freq")
#function to create a raw dataset from the crosstable, taken from https://stackoverflow.com/a/30863433/3889011
tableinv <- function(x){
y <- x[rep(rownames(x),x$Freq),1:(ncol(x)-1)]
rownames(y) <- c(1:nrow(y))
return(y)}
u = tableinv(data)
u$`satisfaction` <- factor(u$`satisfaction`, order = TRUE,
levels = c("not satisfied", "somehow satisfied","very satisfied" ))
u$gender <- relevel(factor(u$gender), ref="men")
#ordinal regression model
m <- polr(satisfaction ~ gender, data = u, Hess=TRUE)
summary(m)
In Python:
import pandas
from statsmodels.miscmodels.ordinal_model import OrderedModel
#creating the dataset
crosstab = pandas.DataFrame(columns=["not satisfied", "somehow satisfied", "very satisfied"],
index=["men", "women"],
data=[[61,158,83],[366,1304,631]])
s = crosstab.stack([0])
raw_data = s.index.repeat(s).to_frame(index=False)
raw_data.columns = ["gender", "satisfaction"]
raw_data
raw_data["satisfaction"] = pandas.Categorical(raw_data["satisfaction"], ordered=True,
categories=["not satisfied", "somehow satisfied", "very satisfied"])
#ordinal regression model
mod_prob = OrderedModel.from_formula("satisfaction ~ C(gender, Treatment(reference='men'))", raw_data, distr='logit')
res_prob = mod_prob.fit()
print(res_prob.summary())
Additional information:
- R version 4.1.3, MASS_7.3-58.1
- Python 3.10.2, statsmodels 0.13.2
Thanks,
Best Answer
I found the answer, this is exactly the same issue described here (Python vs. SPSS): Ordinal Regression: Python vs. SPSS, that is:
(while R handles the coefficient the same way SPSS does.)
Basically, to find the same coefficient as in R, we have to add the exponentiated coefficient ("somehow satisfied/very satisfied", i.e.
exp(0.9564)
) to the "base" coefficient ("not satisfied/somehow satisfied", i.e.-1.5240
), that is-1.5240+exp(0.9564) == 1.078311
, which is very close to whatpolr
returns.Let's check with the exact values in Python to be sure:
output:
Thanks to @josef for his comment that confirms what I suspected!