Solved – scikit-learn’s LogisticRegression minimizing

logisticmaximum likelihoodscikit learn

I have used linear_model.LogisticRegression for a classification problem, with L1 regularization. My first tests were very satisfactory. However, my previous tests were done in R, using the glmnet package, and now I would like to understand why they differ.

I am trying to derive manually (by hand) the objective function minimized by linear_model.LogisticRegression. According to the documentation, the function is $||w||_1 + C \sum_i \log(\exp(-y_i (X_i^T w + c)) + 1)$. However, I am failing to obtain this equation.

The starting point of my derivation is the negative log-likelihood, which I can work down into three derivations.

One derivation follows Andrew Ng's approach, which minimizes a function like $\sum y \log (p(x)) + (1-y_i) \log(1-p(x))$ (note that here $y\in\{0,1\}$). The second approach follows closely this lecture (http://www.stat.cmu.edu/~cshalizi/uADA/12/lectures/ch12.pdf , section 12.2.1), which results in a similar equation to the one used by scikit-learn, but not quite the same: $-\sum_i \log(1 + \exp(X_i^Tw+c)) + \sum_i y_i(X_i^Tw+c) + (\text{L1 term})$ (adapted to scikit-learn notation from equation 12.10). Close, but not cigar!

The third approach follows http://people.csail.mit.edu/jrennie/writing/lr.pdf, where $y\in\{1,-1\}$, like scikit-learn (internally), where the objective function is $-\sum_i \log(y_i g(X_i^Tw + c)) + (\text{L1 term})$, where $g(z) = \frac{1}{1+e^{-z}}$ is the logistic function (adapted from equation 7). Here, the objective function resembles scikit-learn's function, but the exponential term is inverted and the $y$ is not inside the exponential.

So the question is: what exactly is the L1 regularized logistic regression of scikit-learn minimizing ?

If it really is the likelihood, then my derivation is probably wrong at some point, is there a reference that explains the equation minimized according to the docs (and according to sklearn/svm/src/liblinear/linear.cpp) ?

If it's not the likelihood, what is it, and where can I find more information?

Best Answer

Nevermind my question, I did the derivation once again and found out that the scikit equation is correct, and it is minimizing the negative log likelihood. Here are the steps:

Let $(X_i,y_i), i=1,\dots,m$ be pairs of (features, class), where $X_i$ is a column-vector of $N$ features. The class $y_i\in\{1,-1\}$ will be limited to these two values (instead of 0 and 1), which will be useful later.

We are trying to model the probability of a $X$ feature vector to be of class $y=1$ as: $$p(y=1|X;w,c) = g(X_i^Tw+c) = \frac{1}{1+\exp(-(X_i^Tw+c))}\,,$$ where $w,c$ are the weights and intercept of the logistic regression model.

To obtain the optimal $w,c$, we want to maximize the likelihood given the database of labeled data. The optimization problem is: $$\begin{align} \mathop{argmax}_\theta\quad& \mathcal{L}(w,c;X_1,\dots,X_m) \\ &= \prod_{i,y_i=1} p(y=1|X_i;w,c) \prod_{i,y_i=-1} p(y=-1|X_i;w,c) \\ &\langle \text{There are only two classes, so } p(y=-1|\dots) = 1-p(y=1|\dots)\rangle\\ &= \prod_{i,y_i=1} p(y=1|X_i;w,c) \prod_{i,y_i=-1} (1-p(y=1|X_i;w,c)) \\ &\langle \text{Definition of } p\rangle\\ &= \prod_{i,y_i=1} g(X_i^Tw +c) \prod_{i,y_i=-1} (1-g(X_i^Tw +c)) \\ &\langle \text{Useful property: } 1-g(z) = g(-z) \rangle\\ &= \prod_{i,y_i=1} g(X_i^Tw +c) \prod_{i,y_i=-1} g(-(X_i^Tw +c)) \\ &\langle \text{Handy trick of using +1/-1 classes: multiply by } y_i \text{ to have a common product}\rangle\\ &= \prod_{i=1}^m g(y_i (X_i^Tw +c)) \\ \end{align}$$

At this point I decided to apply the logarithm function (since its monotonically increasing) and flip the maximization problem to a minimization, by multiplying by -1:

$$\begin{align} \mathop{argmin}_\theta\quad & -\log \left(\prod_{i=1}^m g(y_i (X_i^Tw +c))\right) \\ &\langle \log (a\cdot b) = \log a + \log b \rangle\\ &= -\sum_{i=1}^{m} \log g(y_i (X_i^Tw +c)) \\ &\langle \text{definition of } g \rangle\\ &= -\sum_{i=1}^{m} \log \frac{1}{1+\exp(-y_i (X_i^Tw +c))} \\ &\langle \log (a/b) = \log a - \log b \rangle\\ &= -\sum_{i=1}^{m} \log 1 - \log (1+\exp(-y_i (X_i^Tw +c))) \\ &\langle \log 1 \text{ is a constant, so it can be ignored} \rangle\\ &= -\sum_{i=1}^{m} - \log (1+\exp(-y_i (X_i^Tw +c))) \\ &= \sum_{i=1}^{m} \log (\exp(-y_i (X_i^Tw +c))+1)\,, \end{align}$$ which is exactly the equation minimized by scikit-learn (without the L1 regularization term, and with $C=1$)

Related Question