Solved – Logistic Regression Failed in statsmodel but works in sklearn; Breast Cancer dataset

pythonscikit learnsingularstatsmodels

I am learning about both the statsmodel library and sklearn. I am trying to construct a logistic model for both libraries trained on the same dataset.

In sklearn, the following works:

# import the data
from sklearn.datasets import load_breast_cancer

data = load_breast_cancer()

X_df = pd.DataFrame(data.data, columns=data.feature_names)
y_df = pd.DataFrame(data.target, columns=['target'])

# split into train and test datasets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_df, y_df, test_size=0.2, random_state=42)

print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

# fit the model and make a prediction
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
Xscaled  = scaler.fit_transform(X_train)

clf1     = LogisticRegression()
clf1.fit(Xscaled, y_train.values.ravel())

y_pred = clf1.predict(scaler.fit_transform(X_test))

accuracy_score(y_test.values.ravel(), y_pred)*100)

This works and gives a result of

98.24561403508771

Now I want to do something similar in the statsmodel library

I do the following (continuing in the same notebook):

import statsmodels.api as sm

Xs = sm.add_constant(Xscaled)
res = sm.Logit(y_train, Xs).fit()

But this gives an error:

LinAlgError: Singular matrix

What is causing the discrpancy between sklearn and statsmodel?

Best Answer

I suspect the reason is that in scikit-learn the default logistic regression is not exactly logistic regression, but rather a penalized logistic regression (by default ridge-regresion i.e. with a L2-penalty). This has the result that it can provide estimates etc. even in case of perfect separation (e.g. some predictors have all 1 or all 0) or situations where some combination of predictors results in "perfect" prediction, while standard non-penalized logistic regression runs into problems (either you think this is a legitimate infite estimate, e.g. 100% of stones thrown into the air fall to the ground, or think of this as a problem e.g. 100% of 10 people that fell out of plane died, but occasionally people will survive).