Solved – Should statsmodels’s GLM produce the same results as R’s lm

logisticpythonr

Should Python's

statsmodels.api.GLM(train_y, train_X, family=sm.families.Binomial()).fit().predict(test_X)

always produce the same results as R's

predict(glm(y ~ ., data=train_X, family=binomial), newdata=test)

where train_y is a pandas DataFrame containing the y column in the corresponding R data.frame, train; and where test_X and train_X are dataframes containing the remaining columns from the test and train dataframes respectively?

If not, are there parameters that I can supply to statsmodels's GLM to make it produce the same results as R's glm?

Best Answer

Yes, they should give the same answers if you fit the same model. Compare

R code

cuse <- read.table("http://data.princeton.edu/wws509/datasets/cuse.dat", 
                   header=TRUE)
attach(cuse)
mod <- glm(cbind(using, notUsing) ~ age + education + wantsMore , family= binomial)
summary(mod)

Python code

import pandas as pd
import statsmodels.api as sm

cuse = pd.read_table("http://data.princeton.edu/wws509/datasets/cuse.dat",
                     sep=" +")
res = sm.formula.glm("using + notUsing ~ C(age, Treatment('<25')) + "
                     "education + wantsMore",  family=sm.families.Binomial(), 
                     data=cuse).fit() 
res.summary()

If there's a convergence issue here, I wouldn't trust either answer without knowing why there are convergence issues. I'd be interested to have a look at some data that can reproduce these convergence failures in R.

Best Answer

Related Solutions

Solved – caret rfe variable selection and test prediction

Solved – Should logistic regression models generated with and without cross validation in the caret.train function in R be the same

Related Question