Logistic Regression – Comparing Scikit Learn and Statsmodels

logisticpythonregressionscikit learnstatsmodels

I am trying to understand why the output from logistic regression of these
two libraries gives different results.

I am using the dataset from UCLA idre tutorial, predicting admit based
on gre, gpa and rank. rank is treated as categorical variable, so it
is first converted to dummy variable with rank_1 dropped. An intercept
column is also added.

py
from patsy import dmatrices
from sklearn.linear_model import LogisticRegression
import pandas as pd
import statsmodels.api as sm

df = pd.read_csv("https://stats.idre.ucla.edu/stat/data/binary.csv")
y, X = dmatrices('admit ~ gre + gpa + C(rank)', df, return_type = 'dataframe')
X.head()
>  Intercept  C(rank)[T.2]  C(rank)[T.3]  C(rank)[T.4]  gre   gpa
0          1             0             1             0  380  3.61
1          1             0             1             0  660  3.67
2          1             0             0             0  800  4.00
3          1             0             0             1  640  3.19
4          1             0             0             1  520  2.93

# Output from scikit-learn
model = LogisticRegression(fit_intercept = False)
mdl = model.fit(X, y)
model.coef_
> array([[-1.35417783, -0.71628751, -1.26038726, -1.49762706,  0.00169198,
     0.13992661]]) 
# corresponding to predictors [Intercept, rank_2, rank_3, rank_4, gre, gpa]

# Output from statsmodels
logit = sm.Logit(y, X)
logit.fit().params
> Optimization terminated successfully.
     Current function value: 0.573147
     Iterations 6
Intercept      -3.989979
C(rank)[T.2]   -0.675443
C(rank)[T.3]   -1.340204
C(rank)[T.4]   -1.551464
gre             0.002264
gpa             0.804038
dtype: float64

The output from statsmodels is the same as shown on the idre website, but I
am not sure why scikit-learn produces a different set of coefficients. Does
it minimize some different loss function? Is there any documentation that
states the implementation?

Best Answer

Your clue to figuring this out should be that the parameter estimates from the scikit-learn estimation are uniformly smaller in magnitude than the statsmodels counterpart. This might lead you to believe that scikit-learn applies some kind of parameter regularization. You can confirm this by reading the scikit-learn documentation.

There is no way to switch off regularization in scikit-learn, but you can make it ineffective by setting the tuning parameter C to a large number. Here is how that works in your case:

# module imports
from patsy import dmatrices
import pandas as pd
from sklearn.linear_model import LogisticRegression
import statsmodels.discrete.discrete_model as sm

# read in the data & create matrices
df = pd.read_csv("http://www.ats.ucla.edu/stat/data/binary.csv")
y, X = dmatrices('admit ~ gre + gpa + C(rank)', df, return_type = 'dataframe')

# sklearn output
model = LogisticRegression(fit_intercept = False, C = 1e9)
mdl = model.fit(X, y)
model.coef_

# sm
logit = sm.Logit(y, X)
logit.fit().params

UPDATE: As correctly pointed out in the comments below, now you can switch off the relularization in scikit-learn by setting penalty='none' (see the docs).