Nevermind my question, I did the derivation once again and found out that the scikit equation is correct, and it is minimizing the negative log likelihood.
Here are the steps:
Let $(X_i,y_i), i=1,\dots,m$ be pairs of (features, class), where $X_i$ is a column-vector of $N$ features. The class $y_i\in\{1,-1\}$ will be limited to these two values (instead of 0 and 1), which will be useful later.
We are trying to model the probability of a $X$ feature vector to be of class $y=1$ as:
$$p(y=1|X;w,c) = g(X_i^Tw+c) = \frac{1}{1+\exp(-(X_i^Tw+c))}\,,$$
where $w,c$ are the weights and intercept of the logistic regression model.
To obtain the optimal $w,c$, we want to maximize the likelihood given the database of labeled data. The optimization problem is:
$$\begin{align} \mathop{argmax}_\theta\quad& \mathcal{L}(w,c;X_1,\dots,X_m) \\
&= \prod_{i,y_i=1} p(y=1|X_i;w,c) \prod_{i,y_i=-1} p(y=-1|X_i;w,c) \\
&\langle \text{There are only two classes, so } p(y=-1|\dots) = 1-p(y=1|\dots)\rangle\\
&= \prod_{i,y_i=1} p(y=1|X_i;w,c) \prod_{i,y_i=-1} (1-p(y=1|X_i;w,c)) \\
&\langle \text{Definition of } p\rangle\\
&= \prod_{i,y_i=1} g(X_i^Tw +c) \prod_{i,y_i=-1} (1-g(X_i^Tw +c)) \\
&\langle \text{Useful property: } 1-g(z) = g(-z) \rangle\\
&= \prod_{i,y_i=1} g(X_i^Tw +c) \prod_{i,y_i=-1} g(-(X_i^Tw +c)) \\
&\langle \text{Handy trick of using +1/-1 classes: multiply by } y_i \text{ to have a common product}\rangle\\
&= \prod_{i=1}^m g(y_i (X_i^Tw +c)) \\
\end{align}$$
At this point I decided to apply the logarithm function (since its monotonically increasing) and flip the maximization problem to a minimization, by multiplying by -1:
$$\begin{align}
\mathop{argmin}_\theta\quad & -\log \left(\prod_{i=1}^m g(y_i (X_i^Tw +c))\right) \\
&\langle \log (a\cdot b) = \log a + \log b \rangle\\
&= -\sum_{i=1}^{m} \log g(y_i (X_i^Tw +c)) \\
&\langle \text{definition of } g \rangle\\
&= -\sum_{i=1}^{m} \log \frac{1}{1+\exp(-y_i (X_i^Tw +c))} \\
&\langle \log (a/b) = \log a - \log b \rangle\\
&= -\sum_{i=1}^{m} \log 1 - \log (1+\exp(-y_i (X_i^Tw +c))) \\
&\langle \log 1 \text{ is a constant, so it can be ignored} \rangle\\
&= -\sum_{i=1}^{m} - \log (1+\exp(-y_i (X_i^Tw +c))) \\
&= \sum_{i=1}^{m} \log (\exp(-y_i (X_i^Tw +c))+1)\,,
\end{align}$$
which is exactly the equation minimized by scikit-learn (without the L1 regularization term, and with $C=1$)
Best Answer
These two are actually (almost) equivalent because of the following property of the logistic function:
$$ \sigma(x) = \frac{1}{1+\exp(-x)} = \frac{\exp(x)}{\exp(x)+1} $$
Also
$$ \sum_{i=1}^n \log ( 1 + \exp( -y_i (X_i^T w + c) ) ) \\ = \sum_{i=1}^n \log \left[ (\exp( y_i (X_i^T w + c) ) + 1) \exp( -y_i (X_i^T w + c) ) \right] \\ = -\sum_{i=1}^n \left[ y_i (X_i^T w + c) - \log (\exp( y_i (X_i^T w + c) ) + 1) \right] $$
Note, though, that your formula doesn't have $y_i$ in the "log part", while this one does. (I guess this is a typo)