Solved – Logit in statsmodels give negative coefficents

logisticstatsmodels

I have a dataset of users with gender and a value "fr" ([0-1]) which I hope can be used to predict the gender.

I tried fitting this data but contrary to what I expect and what make sense it predicts a higher fr means lower chance of female. I assume I must be misunderstanding something, this is my first time using Logistic regression and the statsmodels package.

>>> print(df.head(3))
   gender        fr is_female
0  female  0.438898      True
1    male  0.285226     False
2    male  0.157895     False

>>> print(df.describe())
                 fr
count  64900.000000
mean       0.304351
std        0.160970
min        0.000000
25%        0.200000
50%        0.285714
75%        0.392857
max        1.000000

>>> g = sns.FacetGrid(df, col="gender")
>>> g.map(plt.hist, "fr", bins=25)    

histogram

>>> sns.lmplot(x="fr", y="is_female", data=df.sample(1000), logistic=True, y_jitter=.05)

logistic plot by seaborn

These two plot (I think) shows that it should be possible to use Logit to predict the gender. However, when I run with statsmodels it returns a negative coefficient:

>>> import statsmodels.api as sm
>>> logit = sm.Logit(df["is_female"], df["fr"])
>>> result = logit.fit()
Optimization terminated successfully.
     Current function value: 0.682087
     Iterations 4
>>> print(result.summary())
                               Logit Regression Results                           
==============================================================================
Dep. Variable:              is_female   No. Observations:                64900
Model:                          Logit   Df Residuals:                    64899
Method:                           MLE   Df Model:                            0
Date:                Fri, 23 Dec 2016   Pseudo R-squ.:                -0.08770
Time:                        18:06:44   Log-Likelihood:                -44267.
converged:                       True   LL-Null:                       -40698.
                                        LLR p-value:                       nan
==============================================================================
                 coef    std err          z      P>|z|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
fr            -0.8741      0.023    -37.457      0.000        -0.920    -0.828
==============================================================================

and plotting the result also shows its not correct:

>>> df_ = pd.DataFrame({"fr":np.linspace(0,1,11)})
>>> df_["female_predict"] = result.predict(df_[train_cols])
>>> df_.plot(x="fr", y="female_predict")

result from statsmodels.Logit

Best Answer

Answered in comments: You need to add a constant to the regression.