Logistic Regression – sklearn Logistic Regression Converging to Unexpected Coefficient

logisticregressionscikit learnsigmoid-curve

The case is as follows:

Suppose that

import numpy as np

X = np.array([1, 1, 1])
y = np.array([1, 0, 1])

Then I perform a logistic regression with no intercept to check out the fitted coefficient:

from sklearn.linear_model import LogisticRegression

def fit_predict(X, y, fit_intercept=True):
  model = LogisticRegression(fit_intercept=fit_intercept)
  model.fit(X.reshape(-1, 1), y)
  print(f'model coefficients: {model.coef_}')

fit_predict(X, y, fit_intercept=False)

# output: [[0.2865409]]

I am pretty confused by this output. According to my algebra (directly solving the optimization constraint), the coefficient should be $logit(2/3) \approx 0.6931471805599452$.

Is this because my math is wrong, or because something else is going on that I don't know about?

The algebra is as follows, starting with the following equation:

$$ \sum_i y_i \cdot x_i – sigmoid(x_i) \cdot x_i = 0$$

If we plug the values in, then $$2 = 3\cdot sigmoid(1)$$.

I conclude that $\beta = logit(2/3)$.

Thanks in advance.

Best Answer

I will add my own answer to this question in order to shine some light on why a penalty is added by default. I'm also posting for posterity as you are not the first person to get caught by this and you won't be the last.

Back in 2019, Zachary Lipton discovered sklearn applies the penalty by default too and this sparked a very intense debate on twitter and elsewhere. The long and the short of that discussion is that sklearn sees itself as a machine learning library first which in their eyes means that they favor other things over unbiasedness and estimates of effects. The most striking example of their philosophy (in my opinion) comes when Andreas Mueller plainly asks why someone would want an unbiased implementation of logistic regression. Inference simply isn't on their radar.

Hence, LogisticRegression is not the de jure Logistic Regression. It is a penalized variant thereof by default (and the default penalty doesn't even make any sense). There is another sharp point. Had you learned about penalized logistic regression a la ridge regression or the LASSO, you would be surprised to learn sklearn parameterizes the penalty parameter as the inverse of the regularization strength. Hence setting $\lambda=2$ in LASSO or Ridge would correspond to C=0.5 in LogisticRegression.

Let me sum up by making this completely unambiguous.

If you are intent on estimating the effects of some covariates on a binary outcome, and you insist on using python Do Not Use Sklearn. Use Statsmodels.

If however, you insist on using sklearn, remember that you need to set penalty='none' in the model instantiation step. Else, your estimates will be biased towards the null (by design).

Option 1: Cholesky

The Gram matrix $K$, where $K_{ij} = k(x_i, x_j)$, is all the data we need to run a method. But $K$ is necessarily positive semi-definite, and for the ANOVA kernel, I believe that it is always nonsingular as well. Thus there is some matrix $R$ such that $K = R R^T$, i.e. $K$ is the Gram matrix of $R$. So you can compute $K$ yourself, and then use $R$ as the features.

This approach is exact, but the logistic regression model will be slow, since the dimensionality is equal to the number of inputs, and it can't be extended to new points easily (so you need to know what the test points are at training time).

import numpy as np
from sklearn.metrics.pairwise import check_pairwise_arrays

def anova_kernel(X, Y=None, gamma=None, p=1):
    X, Y = check_pairwise_arrays(X, Y)
    if gamma is None:
        gamma = 1. / X.shape[1]

    diff = X[:, None, :] - Y[None, :, :]
    diff **= 2
    diff *= -gamma
    np.exp(diff, out=diff)
    K = diff.sum(axis=2)
    K **= p
    return K


from scipy.linalg import cholesky
from sklearn.linear_model import LogisticRegression

# make X the matrix of all data points, train + test
# train is indices of train points, test of test
K = anova_kernel(X)
R = cholesky(K, lower=False)
clf = LogisticRegression()
clf.fit(R[train], y_train)
preds = clf.predict(R[test])

I don't recommend this; it's just here for completeness.

Note in the anova_kernel function above that it doesn't do the thing rapidminer does in "double-counting" points where they're both zero. I don't know why it does that; none of the other definitions of the anova kernel I've seen do.

Option 2: Nyström

This is an approximate kernel embedding which works by assuming that the kernel matrix $K$ is low-rank – which it almost always is. In Nyström, you pick a set of $D$ data points, then compute the full kernel evaluation of all the $n$ points to only those $D$ points. If you pick the first $D$ points, then the kernel matrix looks like:

$$ \begin{bmatrix} A & B \\ B^T & ? \end{bmatrix} $$ where you've computed $A$ (the pairwise kernel evaluations among the $D$ points) and $B$ (the kernel evaluations from each of the other points to the $D$ points), and $?$ will be guessed based on the assumption that $K$ is rank $D$. It turns out that the way to do that is to eigendecompose $A = U \Lambda U^T$, and then use the features $Z = \begin{bmatrix}U \Lambda^{\tfrac12} \\ B^T U \Lambda^\dagger \Lambda^{\tfrac12}\end{bmatrix}$.

Compared to Cholesky, this approach is inexact, but gives a low-dimensional embedding, and easily allows extension to new data points. Plus, it's already implemented in sklearn.kernel_approximation.Nystroem.

from sklearn.kernel_approximation import Nystroem
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

K_train = anova_kernel(X_train)
clf = Pipeline([
    ('nys', Nystroem(kernel='precomputed', n_components=100)),
    ('lr', LogisticRegression())
])
clf.fit(K_train, y_train)

K_test = anova_kernel(X_test, X_train)
preds = clf.predict(K_test)

Option 3: Random Fourier Features

The other major paradigm of approximate kernel embeddings is random Fourier features (also sometimes called "random kitchen sinks"). These work on shift-invariant kernels $k(x, y) = k(x - y)$ by taking the Fourier transform of $k$, which by Bochner's theorem will be (scalable to) a probability distribution, sampling $\{\omega_i\}_{i=1}^{D/2}$ from it, and then using the feature map $$z(x) = \begin{bmatrix}\sin(\omega_1^T x) & \cos(\omega_1^T x) & \dots & \sin(\omega_{D/2}^T x) & \cos(\omega_{D/2}^T x)\end{bmatrix}.$$ (Sometimes you'll see this as $\cos(\omega_i^T X + b_i)$ instead, but this way is better.)

scikit-learn has these implemented for RBF and some variants of $\chi^2$ kernels, but not for ANOVA. But the kernel is shift-invariant:

$$ k(\Delta) = \left( \sum_{i=1}^d \exp(-\gamma \Delta_i^2) \right)^p $$

The problem is, I don't know how to Fourier transform it. :) For $d = 1$, it's a Gaussian with mean 0 and variance $2 p \gamma$. For $p = 1$, $d > 1$, you can do it by getting an embedding for each dimension and concatenating them together.

For integer $p$, you might be able to turn the product of sums into a sum of products and then Fourier transform each term, I suppose. Don't think that'd generalize to non-integral $p$, and I haven't tried it for integral $p > 1$ yet either.

Logistic Regression – Determining Number of Coefficients and Intercepts in Sklearn

The library creates $c$ neurons for $c$ classes for $c>2$, which yields $c\times f$ coefficients and $c$ biases, where $f$ is the number of features. So, it's like a collection of logistic regressions, or like a single layer neural network.

Best Answer

Related Solutions

Solved – Logistic regression using ANOVA kernel in SKLearn

Option 1: Cholesky

Option 2: Nyström

Option 3: Random Fourier Features

Logistic Regression – Determining Number of Coefficients and Intercepts in Sklearn

Related Question