Solved – How to get cost function of logistic regression in Scikit Learn from log likelihood function

classificationlogisticmaximum likelihoodregularization

The log likelihood function of logistic regression is as below:

\begin{align}
\ln(L(x, y; w))
&=\sum\limits_{i=1}^n [y_i\ln(p(x_i;w))+(1-y_i)\ln(1-p(x_i;w))]
\\&=\sum\limits_{i=1}^n [y_i\ln(\dfrac{1}{1+e^{-w'x_i}})+(1-y_i)\ln(1-\dfrac{1}{1+e^{-w'x_i}})]
\end{align}


The cost function including penalty and L1/L2 regularization is as below, see link
enter image description here


I understand C and L1/L2 norm but cannot derive cost function, can anyone help with the derivation process?

Best Answer

Your log-likelihood is: $$ \log L(x, y; w) = \sum_{i=1}^N \ell_i $$ where \begin{align} \ell_i &= y_i \log\left( \frac{1}{1 + \exp(- w^T x_i)} \right) + (1-y_i) \log\left( 1 - \frac{1}{1 + \exp(- w^T x_i)} \right) \\&= y_i \log\left( \frac{1}{1 + \exp(- w^T x_i)} \right) + (1-y_i) \log\left( \frac{1 + \exp(- w^T x_i)}{1 + \exp(- w^T x_i)} - \frac{1}{1 + \exp(- w^T x_i)} \right) \\&= y_i \log\left( \frac{1}{1 + \exp(- w^T x_i)} \right) + (1-y_i) \log\left( \frac{\exp(- w^T x_i)}{1 + \exp(- w^T x_i)} \right) \\&= y_i \log\left( \frac{1}{1 + \exp(- w^T x_i)} \right) + (1-y_i) \log\left( \frac{\exp(- w^T x_i)}{1 + \exp(- w^T x_i)} \times \frac{\exp(w^T x_i)}{\exp(w^T x_i)} \right) \\&= y_i \log\left( \frac{1}{1 + \exp(- w^T x_i)} \right) + (1-y_i) \log\left( \frac{1}{\exp(w^T x_i) + 1} \right) \\&= \log\left( \frac{1}{1 + \exp\left( \begin{cases}- w^T x_i & y_i = 1 \\ w^T x_i & y_i = 0\end{cases} \right)} \right) \\&= \log\left( \frac{1}{1 + \exp\left( - y'_i w^T x_i \right)} \right) \\&= -\log\left( 1 + \exp\left( - y_i' w^T x_i \right) \right) \end{align} where $y_i \in \{0, 1\}$ but we defined $y_i' \in \{-1, 1\}$.

To get to the loss function in the image, first we need to add an intercept to the model, replacing $w^T x_i$ with $w^T x_i + c$. Then: $$ \arg\max \log L(X, y; w, c) = \arg\min - \log L(X, y; w, c) ,$$ and then we add a regularizer $P(c, w)$: $$ \arg\min \lambda P(w, c) - \log L(X, y; w, c) = \arg\min P(w, c) - \frac{1}{\lambda} \log L(X, y; w, c) ,$$ where we then set $C := \frac1\lambda$. The $L_2$ penalty is $$ P(w, c) = \frac12 w^T w = \frac12 \sum_{j=1}^d w_j^2 ;$$ that $\tfrac12$ is just done for mathematical convenience when we differentiate, it doesn't really affect anything. The $L_1$ penalty has $$ P(w, c) = \lVert w \rVert_1 = \sum_{j=1}^d \lvert w_j \rvert .$$