Solved – Logistic Regression Failed in statsmodel but works in sklearn; Breast Cancer dataset

pythonscikit learnsingularstatsmodels

I am learning about both the statsmodel library and sklearn. I am trying to construct a logistic model for both libraries trained on the same dataset.

In sklearn, the following works:

# import the data
from sklearn.datasets import load_breast_cancer

data = load_breast_cancer()

X_df = pd.DataFrame(data.data, columns=data.feature_names)
y_df = pd.DataFrame(data.target, columns=['target'])

# split into train and test datasets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_df, y_df, test_size=0.2, random_state=42)

print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

# fit the model and make a prediction
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
Xscaled  = scaler.fit_transform(X_train)

clf1     = LogisticRegression()
clf1.fit(Xscaled, y_train.values.ravel())

y_pred = clf1.predict(scaler.fit_transform(X_test))

accuracy_score(y_test.values.ravel(), y_pred)*100)

This works and gives a result of

98.24561403508771

Now I want to do something similar in the statsmodel library

I do the following (continuing in the same notebook):

import statsmodels.api as sm

Xs = sm.add_constant(Xscaled)
res = sm.Logit(y_train, Xs).fit()

But this gives an error:

LinAlgError: Singular matrix

What is causing the discrpancy between sklearn and statsmodel?

Best Answer

I suspect the reason is that in scikit-learn the default logistic regression is not exactly logistic regression, but rather a penalized logistic regression (by default ridge-regresion i.e. with a L2-penalty). This has the result that it can provide estimates etc. even in case of perfect separation (e.g. some predictors have all 1 or all 0) or situations where some combination of predictors results in "perfect" prediction, while standard non-penalized logistic regression runs into problems (either you think this is a legitimate infite estimate, e.g. 100% of stones thrown into the air fall to the ground, or think of this as a problem e.g. 100% of 10 people that fell out of plane died, but occasionally people will survive).

Option 1: Cholesky

The Gram matrix $K$, where $K_{ij} = k(x_i, x_j)$, is all the data we need to run a method. But $K$ is necessarily positive semi-definite, and for the ANOVA kernel, I believe that it is always nonsingular as well. Thus there is some matrix $R$ such that $K = R R^T$, i.e. $K$ is the Gram matrix of $R$. So you can compute $K$ yourself, and then use $R$ as the features.

This approach is exact, but the logistic regression model will be slow, since the dimensionality is equal to the number of inputs, and it can't be extended to new points easily (so you need to know what the test points are at training time).

import numpy as np
from sklearn.metrics.pairwise import check_pairwise_arrays

def anova_kernel(X, Y=None, gamma=None, p=1):
    X, Y = check_pairwise_arrays(X, Y)
    if gamma is None:
        gamma = 1. / X.shape[1]

    diff = X[:, None, :] - Y[None, :, :]
    diff **= 2
    diff *= -gamma
    np.exp(diff, out=diff)
    K = diff.sum(axis=2)
    K **= p
    return K


from scipy.linalg import cholesky
from sklearn.linear_model import LogisticRegression

# make X the matrix of all data points, train + test
# train is indices of train points, test of test
K = anova_kernel(X)
R = cholesky(K, lower=False)
clf = LogisticRegression()
clf.fit(R[train], y_train)
preds = clf.predict(R[test])

I don't recommend this; it's just here for completeness.

Note in the anova_kernel function above that it doesn't do the thing rapidminer does in "double-counting" points where they're both zero. I don't know why it does that; none of the other definitions of the anova kernel I've seen do.

Option 2: Nyström

This is an approximate kernel embedding which works by assuming that the kernel matrix $K$ is low-rank – which it almost always is. In Nyström, you pick a set of $D$ data points, then compute the full kernel evaluation of all the $n$ points to only those $D$ points. If you pick the first $D$ points, then the kernel matrix looks like:

$$ \begin{bmatrix} A & B \\ B^T & ? \end{bmatrix} $$ where you've computed $A$ (the pairwise kernel evaluations among the $D$ points) and $B$ (the kernel evaluations from each of the other points to the $D$ points), and $?$ will be guessed based on the assumption that $K$ is rank $D$. It turns out that the way to do that is to eigendecompose $A = U \Lambda U^T$, and then use the features $Z = \begin{bmatrix}U \Lambda^{\tfrac12} \\ B^T U \Lambda^\dagger \Lambda^{\tfrac12}\end{bmatrix}$.

Compared to Cholesky, this approach is inexact, but gives a low-dimensional embedding, and easily allows extension to new data points. Plus, it's already implemented in sklearn.kernel_approximation.Nystroem.

from sklearn.kernel_approximation import Nystroem
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

K_train = anova_kernel(X_train)
clf = Pipeline([
    ('nys', Nystroem(kernel='precomputed', n_components=100)),
    ('lr', LogisticRegression())
])
clf.fit(K_train, y_train)

K_test = anova_kernel(X_test, X_train)
preds = clf.predict(K_test)

Option 3: Random Fourier Features

The other major paradigm of approximate kernel embeddings is random Fourier features (also sometimes called "random kitchen sinks"). These work on shift-invariant kernels $k(x, y) = k(x - y)$ by taking the Fourier transform of $k$, which by Bochner's theorem will be (scalable to) a probability distribution, sampling $\{\omega_i\}_{i=1}^{D/2}$ from it, and then using the feature map $$z(x) = \begin{bmatrix}\sin(\omega_1^T x) & \cos(\omega_1^T x) & \dots & \sin(\omega_{D/2}^T x) & \cos(\omega_{D/2}^T x)\end{bmatrix}.$$ (Sometimes you'll see this as $\cos(\omega_i^T X + b_i)$ instead, but this way is better.)

scikit-learn has these implemented for RBF and some variants of $\chi^2$ kernels, but not for ANOVA. But the kernel is shift-invariant:

$$ k(\Delta) = \left( \sum_{i=1}^d \exp(-\gamma \Delta_i^2) \right)^p $$

The problem is, I don't know how to Fourier transform it. :) For $d = 1$, it's a Gaussian with mean 0 and variance $2 p \gamma$. For $p = 1$, $d > 1$, you can do it by getting an embedding for each dimension and concatenating them together.

For integer $p$, you might be able to turn the product of sums into a sum of products and then Fourier transform each term, I suppose. Don't think that'd generalize to non-integral $p$, and I haven't tried it for integral $p > 1$ yet either.

Solved – How to make sense of this PCA plot with logistic regression decision boundary (breast cancer data)

What does the 2-d PCA data/plot mean?

The 2-d PCA data/plot represent two "compound features" which PCA created to capture as much of the variance in your original 13 features as possible.

Assuming your 13 features are linearly independent (e.g. one feature is not just another feature times 2, for every row in your data), it would take 13 dimensions to capture 100% of the variation in your 13 raw features. However, often PCA can capture e.g. 98% of the variation in your data in just a few PCA dimensions. To see how much of the variance is explained by each PCA dimension for your problem, print x_pca.explained_variance_ratio_ after you fit() your x_pca object.

When PCA can capture a large amount of the variance of your features in just 2 dimensions, that's especially convenient because then you can plot those 2 PCA dimensions as you have, and know that any groupings which show up on the 2-d plot correspond to natural groupings in your 13-dimensional data.

What does the decision boundary mean?

The decision boundary in your code is a prediction of your target variable, using as features (independent variables) the first two PCA dimensions of your 13-dimension original feature set.

Why is your decision boundary not in the obvious gap?

Remember the PCA dimensions were formed just based on your 13 independent variables, without looking at your target. The decision boundary is not a decision boundary between PCA clusters, it's a decision boundary using the PCA dimensions to predict target.

So the fact that the decision boundary is not totally between the clusters means PCA's first two dimensions of your 13 features do a good job of separating your target classes, but not a perfect job.

How to improve your plot

What you really care about is target, right? So in your plot, don't plot all the points as red. Color them by target class. Then you will have a plot that shows you how well PCA and the information in your original features (represented by distance/space between clusters on your plot) distinguishes between target classes (which would be colors on the new plot).

Best Answer

Related Solutions

Solved – Logistic regression using ANOVA kernel in SKLearn

Option 1: Cholesky

Option 2: Nyström

Option 3: Random Fourier Features

Solved – How to make sense of this PCA plot with logistic regression decision boundary (breast cancer data)

Related Question