Nevermind my question, I did the derivation once again and found out that the scikit equation is correct, and it is minimizing the negative log likelihood.
Here are the steps:
Let $(X_i,y_i), i=1,\dots,m$ be pairs of (features, class), where $X_i$ is a column-vector of $N$ features. The class $y_i\in\{1,-1\}$ will be limited to these two values (instead of 0 and 1), which will be useful later.
We are trying to model the probability of a $X$ feature vector to be of class $y=1$ as:
$$p(y=1|X;w,c) = g(X_i^Tw+c) = \frac{1}{1+\exp(-(X_i^Tw+c))}\,,$$
where $w,c$ are the weights and intercept of the logistic regression model.
To obtain the optimal $w,c$, we want to maximize the likelihood given the database of labeled data. The optimization problem is:
$$\begin{align} \mathop{argmax}_\theta\quad& \mathcal{L}(w,c;X_1,\dots,X_m) \\
&= \prod_{i,y_i=1} p(y=1|X_i;w,c) \prod_{i,y_i=-1} p(y=-1|X_i;w,c) \\
&\langle \text{There are only two classes, so } p(y=-1|\dots) = 1-p(y=1|\dots)\rangle\\
&= \prod_{i,y_i=1} p(y=1|X_i;w,c) \prod_{i,y_i=-1} (1-p(y=1|X_i;w,c)) \\
&\langle \text{Definition of } p\rangle\\
&= \prod_{i,y_i=1} g(X_i^Tw +c) \prod_{i,y_i=-1} (1-g(X_i^Tw +c)) \\
&\langle \text{Useful property: } 1-g(z) = g(-z) \rangle\\
&= \prod_{i,y_i=1} g(X_i^Tw +c) \prod_{i,y_i=-1} g(-(X_i^Tw +c)) \\
&\langle \text{Handy trick of using +1/-1 classes: multiply by } y_i \text{ to have a common product}\rangle\\
&= \prod_{i=1}^m g(y_i (X_i^Tw +c)) \\
\end{align}$$
At this point I decided to apply the logarithm function (since its monotonically increasing) and flip the maximization problem to a minimization, by multiplying by -1:
$$\begin{align}
\mathop{argmin}_\theta\quad & -\log \left(\prod_{i=1}^m g(y_i (X_i^Tw +c))\right) \\
&\langle \log (a\cdot b) = \log a + \log b \rangle\\
&= -\sum_{i=1}^{m} \log g(y_i (X_i^Tw +c)) \\
&\langle \text{definition of } g \rangle\\
&= -\sum_{i=1}^{m} \log \frac{1}{1+\exp(-y_i (X_i^Tw +c))} \\
&\langle \log (a/b) = \log a - \log b \rangle\\
&= -\sum_{i=1}^{m} \log 1 - \log (1+\exp(-y_i (X_i^Tw +c))) \\
&\langle \log 1 \text{ is a constant, so it can be ignored} \rangle\\
&= -\sum_{i=1}^{m} - \log (1+\exp(-y_i (X_i^Tw +c))) \\
&= \sum_{i=1}^{m} \log (\exp(-y_i (X_i^Tw +c))+1)\,,
\end{align}$$
which is exactly the equation minimized by scikit-learn (without the L1 regularization term, and with $C=1$)
These two are actually (almost) equivalent because of the following property of the logistic function:
$$
\sigma(x) = \frac{1}{1+\exp(-x)} = \frac{\exp(x)}{\exp(x)+1}
$$
Also
$$
\sum_{i=1}^n \log ( 1 + \exp( -y_i (X_i^T w + c) ) ) \\
= \sum_{i=1}^n \log \left[ (\exp( y_i (X_i^T w + c) ) + 1) \exp( -y_i (X_i^T w + c) ) \right] \\
= -\sum_{i=1}^n \left[ y_i (X_i^T w + c) - \log (\exp( y_i (X_i^T w + c) ) + 1) \right]
$$
Note, though, that your formula doesn't have $y_i$ in the "log part", while this one does. (I guess this is a typo)
Best Answer
Your log-likelihood is: $$ \log L(x, y; w) = \sum_{i=1}^N \ell_i $$ where \begin{align} \ell_i &= y_i \log\left( \frac{1}{1 + \exp(- w^T x_i)} \right) + (1-y_i) \log\left( 1 - \frac{1}{1 + \exp(- w^T x_i)} \right) \\&= y_i \log\left( \frac{1}{1 + \exp(- w^T x_i)} \right) + (1-y_i) \log\left( \frac{1 + \exp(- w^T x_i)}{1 + \exp(- w^T x_i)} - \frac{1}{1 + \exp(- w^T x_i)} \right) \\&= y_i \log\left( \frac{1}{1 + \exp(- w^T x_i)} \right) + (1-y_i) \log\left( \frac{\exp(- w^T x_i)}{1 + \exp(- w^T x_i)} \right) \\&= y_i \log\left( \frac{1}{1 + \exp(- w^T x_i)} \right) + (1-y_i) \log\left( \frac{\exp(- w^T x_i)}{1 + \exp(- w^T x_i)} \times \frac{\exp(w^T x_i)}{\exp(w^T x_i)} \right) \\&= y_i \log\left( \frac{1}{1 + \exp(- w^T x_i)} \right) + (1-y_i) \log\left( \frac{1}{\exp(w^T x_i) + 1} \right) \\&= \log\left( \frac{1}{1 + \exp\left( \begin{cases}- w^T x_i & y_i = 1 \\ w^T x_i & y_i = 0\end{cases} \right)} \right) \\&= \log\left( \frac{1}{1 + \exp\left( - y'_i w^T x_i \right)} \right) \\&= -\log\left( 1 + \exp\left( - y_i' w^T x_i \right) \right) \end{align} where $y_i \in \{0, 1\}$ but we defined $y_i' \in \{-1, 1\}$.
To get to the loss function in the image, first we need to add an intercept to the model, replacing $w^T x_i$ with $w^T x_i + c$. Then: $$ \arg\max \log L(X, y; w, c) = \arg\min - \log L(X, y; w, c) ,$$ and then we add a regularizer $P(c, w)$: $$ \arg\min \lambda P(w, c) - \log L(X, y; w, c) = \arg\min P(w, c) - \frac{1}{\lambda} \log L(X, y; w, c) ,$$ where we then set $C := \frac1\lambda$. The $L_2$ penalty is $$ P(w, c) = \frac12 w^T w = \frac12 \sum_{j=1}^d w_j^2 ;$$ that $\tfrac12$ is just done for mathematical convenience when we differentiate, it doesn't really affect anything. The $L_1$ penalty has $$ P(w, c) = \lVert w \rVert_1 = \sum_{j=1}^d \lvert w_j \rvert .$$