Nevermind my question, I did the derivation once again and found out that the scikit equation is correct, and it is minimizing the negative log likelihood.
Here are the steps:
Let $(X_i,y_i), i=1,\dots,m$ be pairs of (features, class), where $X_i$ is a column-vector of $N$ features. The class $y_i\in\{1,-1\}$ will be limited to these two values (instead of 0 and 1), which will be useful later.
We are trying to model the probability of a $X$ feature vector to be of class $y=1$ as:
$$p(y=1|X;w,c) = g(X_i^Tw+c) = \frac{1}{1+\exp(-(X_i^Tw+c))}\,,$$
where $w,c$ are the weights and intercept of the logistic regression model.
To obtain the optimal $w,c$, we want to maximize the likelihood given the database of labeled data. The optimization problem is:
$$\begin{align} \mathop{argmax}_\theta\quad& \mathcal{L}(w,c;X_1,\dots,X_m) \\
&= \prod_{i,y_i=1} p(y=1|X_i;w,c) \prod_{i,y_i=-1} p(y=-1|X_i;w,c) \\
&\langle \text{There are only two classes, so } p(y=-1|\dots) = 1-p(y=1|\dots)\rangle\\
&= \prod_{i,y_i=1} p(y=1|X_i;w,c) \prod_{i,y_i=-1} (1-p(y=1|X_i;w,c)) \\
&\langle \text{Definition of } p\rangle\\
&= \prod_{i,y_i=1} g(X_i^Tw +c) \prod_{i,y_i=-1} (1-g(X_i^Tw +c)) \\
&\langle \text{Useful property: } 1-g(z) = g(-z) \rangle\\
&= \prod_{i,y_i=1} g(X_i^Tw +c) \prod_{i,y_i=-1} g(-(X_i^Tw +c)) \\
&\langle \text{Handy trick of using +1/-1 classes: multiply by } y_i \text{ to have a common product}\rangle\\
&= \prod_{i=1}^m g(y_i (X_i^Tw +c)) \\
\end{align}$$
At this point I decided to apply the logarithm function (since its monotonically increasing) and flip the maximization problem to a minimization, by multiplying by -1:
$$\begin{align}
\mathop{argmin}_\theta\quad & -\log \left(\prod_{i=1}^m g(y_i (X_i^Tw +c))\right) \\
&\langle \log (a\cdot b) = \log a + \log b \rangle\\
&= -\sum_{i=1}^{m} \log g(y_i (X_i^Tw +c)) \\
&\langle \text{definition of } g \rangle\\
&= -\sum_{i=1}^{m} \log \frac{1}{1+\exp(-y_i (X_i^Tw +c))} \\
&\langle \log (a/b) = \log a - \log b \rangle\\
&= -\sum_{i=1}^{m} \log 1 - \log (1+\exp(-y_i (X_i^Tw +c))) \\
&\langle \log 1 \text{ is a constant, so it can be ignored} \rangle\\
&= -\sum_{i=1}^{m} - \log (1+\exp(-y_i (X_i^Tw +c))) \\
&= \sum_{i=1}^{m} \log (\exp(-y_i (X_i^Tw +c))+1)\,,
\end{align}$$
which is exactly the equation minimized by scikit-learn (without the L1 regularization term, and with $C=1$)
A lasso solution $\widehat{\beta}(\lambda)$ solves
$$\min_\beta \frac{1}{2}||y-X\beta||_2^2 +\lambda||\beta||_1.$$
and it is well known that we have $\widehat{\beta}(\lambda)=0$ for all $\lambda \geq \lambda_1$ where $\lambda_1 = \max_j |X_j^Ty|$, which should give you the desired value.
Note that $\lambda_1$ may need a different scaling if the objective function is scaled differently.
Using the cars example with GLMNET:
fit<-glmnet(as.matrix(mtcars[,-1]),mtcars[,1], intercept=FALSE, standardize=FALSE)
1/32*max(abs(t(as.matrix(mtcars[,-1]))%*%mtcars[,1]))/(head(fit$lambda))[1]
This gives the value 1, as expected.
Note that standardize as well as intercept is set to FALSE. If standardize and intercept is set to TRUE, then the value of $\lambda$ is calculated on the scaled regressors.
(In this regards, take a look at https://think-lab.github.io/d/205/#5 for how to perform a proper scaling to get the results you want.):
xy<-scale(mtcars)
fit<-glmnet(as.matrix(mtcars[,-1]),mtcars[,1])
(1/32*max(abs(t(xy[,-1])%*%mtcars[,1]*sqrt(32/31))))/(head(fit$lambda))[1]
This once again gives the value 1...
However I am not sure what glmnet is calculating if intercept = TRUE but standardize = FALSE.
We saw that glmnet with its standard options calculates $\lambda_{1}$ as $$\lambda_{1} = \max_j| \frac{1}{n} \sum_{i=1}^n x_j^*y|$$, where $x_j^* = \frac{x_j-\overline{x_j}}{\sqrt{\frac{1}{n}\sum_{i=1}^n (x_j-\overline{x_j})^2}}.$
It turns out that for an elastic net problem (corresponding to $\alpha \in (0,1]$ in glmnet) its maximum value $\lambda_{1,\alpha}$ is calculated as
$$\lambda_{1,\alpha}= \lambda_{1}/\alpha$$.
Indeed, setting for example $\alpha=0.3$ we have:
aa<-0.3
xy<-scale(mtcars)
fit<-glmnet(as.matrix(mtcars[,-1]),mtcars[,1],a=aa)
1/aa*(1/32*max(abs(t(xy[,-1])%*%mtcars[,1]*sqrt(32/31))))/(head(fit$lambda))[1]
which results once again in an output value of $1$.
That's for the calculations. Note however that the elastic net criterion can be rewritten as a standard lasso problem.
Best Answer
You can use the elasticnet penalty in sklearn's Logistic Regression classifier:
LogisticRegressionCV will take these parameters as well