I am trying to understand why the output from logistic regression of these
two libraries gives different results.
I am using the dataset from UCLA idre tutorial, predicting admit
based
on gre
, gpa
and rank
. rank
is treated as categorical variable, so it
is first converted to dummy variable with rank_1
dropped. An intercept
column is also added.
py
from patsy import dmatrices
from sklearn.linear_model import LogisticRegression
import pandas as pd
import statsmodels.api as sm
df = pd.read_csv("https://stats.idre.ucla.edu/stat/data/binary.csv")
y, X = dmatrices('admit ~ gre + gpa + C(rank)', df, return_type = 'dataframe')
X.head()
> Intercept C(rank)[T.2] C(rank)[T.3] C(rank)[T.4] gre gpa
0 1 0 1 0 380 3.61
1 1 0 1 0 660 3.67
2 1 0 0 0 800 4.00
3 1 0 0 1 640 3.19
4 1 0 0 1 520 2.93
# Output from scikit-learn
model = LogisticRegression(fit_intercept = False)
mdl = model.fit(X, y)
model.coef_
> array([[-1.35417783, -0.71628751, -1.26038726, -1.49762706, 0.00169198,
0.13992661]])
# corresponding to predictors [Intercept, rank_2, rank_3, rank_4, gre, gpa]
# Output from statsmodels
logit = sm.Logit(y, X)
logit.fit().params
> Optimization terminated successfully.
Current function value: 0.573147
Iterations 6
Intercept -3.989979
C(rank)[T.2] -0.675443
C(rank)[T.3] -1.340204
C(rank)[T.4] -1.551464
gre 0.002264
gpa 0.804038
dtype: float64
The output from statsmodels
is the same as shown on the idre website, but I
am not sure why scikit-learn produces a different set of coefficients. Does
it minimize some different loss function? Is there any documentation that
states the implementation?
Best Answer
Your clue to figuring this out should be that the parameter estimates from the scikit-learn estimation are uniformly smaller in magnitude than the statsmodels counterpart. This might lead you to believe that scikit-learn applies some kind of parameter regularization. You can confirm this by reading the scikit-learn documentation.
There is no way to switch off regularization in scikit-learn, but you can make it ineffective by setting the tuning parameter
C
to a large number. Here is how that works in your case:UPDATE: As correctly pointed out in the comments below, now you can switch off the relularization in scikit-learn by setting
penalty='none'
(see the docs).