Solved – Should statsmodels’s GLM produce the same results as R’s lm

logisticpythonr

Should Python's

statsmodels.api.GLM(train_y, train_X, family=sm.families.Binomial()).fit().predict(test_X)

always produce the same results as R's

predict(glm(y ~ ., data=train_X, family=binomial), newdata=test)

where train_y is a pandas DataFrame containing the y column in the corresponding R data.frame, train; and where test_X and train_X are dataframes containing the remaining columns from the test and train dataframes respectively?

If not, are there parameters that I can supply to statsmodels's GLM to make it produce the same results as R's glm?

Best Answer

Yes, they should give the same answers if you fit the same model. Compare

R code

cuse <- read.table("http://data.princeton.edu/wws509/datasets/cuse.dat", 
                   header=TRUE)
attach(cuse)
mod <- glm(cbind(using, notUsing) ~ age + education + wantsMore , family= binomial)
summary(mod)

Python code

import pandas as pd
import statsmodels.api as sm

cuse = pd.read_table("http://data.princeton.edu/wws509/datasets/cuse.dat",
                     sep=" +")
res = sm.formula.glm("using + notUsing ~ C(age, Treatment('<25')) + "
                     "education + wantsMore",  family=sm.families.Binomial(), 
                     data=cuse).fit() 
res.summary()

If there's a convergence issue here, I wouldn't trust either answer without knowing why there are convergence issues. I'd be interested to have a look at some data that can reproduce these convergence failures in R.

Related Question