[Math] Understanding partial derivative of logistic regression cost function

calculusderivativeslogistic regressionpartial derivative

I'm following along in Andrew Ng's great lecture series on machine learning, and he presents the following as the cost function for a logistic regression model [link]:

$$L(a,y) = -(y \log(a) + (1 – y) \log(1 – a)) $$

He then builds a little math graph, or series of equations, that can be used as helpers for computing the partial derivatives of $L$ with respect to various variables [link]:

$$ z = w_1x_1 + w_2x_2 + b $$
$$ \hat{y} = a = \sigma(z) $$

Next he says that the following represents the derivative of $L$ wrt $a$ [link]:

$$ \frac{\partial L}{\partial a} = -\frac{y}{a} + \frac{1-y}{1-a} $$

Unfortunately, he doesn't give any clues as to how this can be derived. Does anyone here know how to derive this partial derivative given the equations above? I'd be very grateful for any insights others can offer on this question!

Best Answer

if your equation is

$$L(a,y)=-\left(y\log(a)+(1-y)\log(1-a)\right)$$

we get

$$\frac{\partial L(a,y)}{\partial a}=-\left(\frac{y}{a}+\frac{1-y}{1-a}\cdot (-1)\right)$$

which simplifies to

$$-\frac{y}{a} + \frac{1-y}{1-a}$$

since

$$(\log(a))'=\frac{1}{a}$$

and

$$(\log(1-a))'=\frac{1}{1-a}\cdot (-1)$$

using the chain rule

Related Solutions

Machine Learning – Derivative of Cost Function for Logistic Regression

The reason is the following. We use the notation:

$$\theta x^i:=\theta_0+\theta_1 x^i_1+\dots+\theta_p x^i_p.$$

Then

$$\log h_\theta(x^i)=\log\frac{1}{1+e^{-\theta x^i} }=-\log ( 1+e^{-\theta x^i} ),$$ $$\log(1- h_\theta(x^i))=\log(1-\frac{1}{1+e^{-\theta x^i} })=\log (e^{-\theta x^i} )-\log ( 1+e^{-\theta x^i} )=-\theta x^i-\log ( 1+e^{-\theta x^i} ),$$ [ this used: $ 1 = \frac{(1+e^{-\theta x^i})}{(1+e^{-\theta x^i})},$ the 1's in numerator cancel, then we used: $\log(x/y) = \log(x) - \log(y)$]

Since our original cost function is the form of:

$$J(\theta)=-\frac{1}{m}\sum_{i=1}^{m}y^{i}\log(h_\theta(x^{i}))+(1-y^{i})\log(1-h_\theta(x^{i}))$$

Plugging in the two simplified expressions above, we obtain $$J(\theta)=-\frac{1}{m}\sum_{i=1}^m \left[-y^i(\log ( 1+e^{-\theta x^i})) + (1-y^i)(-\theta x^i-\log ( 1+e^{-\theta x^i} ))\right]$$, which can be simplified to: $$J(\theta)=-\frac{1}{m}\sum_{i=1}^m \left[y_i\theta x^i-\theta x^i-\log(1+e^{-\theta x^i})\right]=-\frac{1}{m}\sum_{i=1}^m \left[y_i\theta x^i-\log(1+e^{\theta x^i})\right],~~(*)$$

where the second equality follows from

$$-\theta x^i-\log(1+e^{-\theta x^i})= -\left[ \log e^{\theta x^i}+ \log(1+e^{-\theta x^i} ) \right]=-\log(1+e^{\theta x^i}). $$ [ we used $ \log(x) + \log(y) = log(x y) $ ]

All you need now is to compute the partial derivatives of $(*)$ w.r.t. $\theta_j$. As $$\frac{\partial}{\partial \theta_j}y_i\theta x^i=y_ix^i_j, $$ $$\frac{\partial}{\partial \theta_j}\log(1+e^{\theta x^i})=\frac{x^i_je^{\theta x^i}}{1+e^{\theta x^i}}=x^i_jh_\theta(x^i),$$

the thesis follows.

[Math] Logistic regression – Prove That the Cost Function Is Convex

Here I will prove the below loss function is a convex function. \begin{equation} L(\theta, \theta_0) = \sum_{i=1}^N \left( - y^i \log(\sigma(\theta^T x^i + \theta_0)) - (1-y^i) \log(1-\sigma(\theta^T x^i + \theta_0)) \right) \end{equation}

Then will show that the loss function below that the questioner proposed is NOT a convex function.

\begin{equation} L(\theta, \theta_0) = \sum_{i=1}^N \left( y^i (1-\sigma(\theta^T x^i + \theta_0))^2 + (1-y^i) \sigma(\theta^T x^i + \theta_0)^2 \right) \end{equation}

To prove that solving a logistic regression using the first loss function is solving a convex optimization problem, we need two facts (to prove).

$ \newcommand{\reals}{{\mathbf{R}}} \newcommand{\preals}{{\reals_+}} \newcommand{\ppreals}{{\reals_{++}}} $

Suppose that $\sigma: \reals \to \ppreals$ is the sigmoid function defined by

\begin{equation} \sigma(z) = 1/(1+\exp(-z)) \end{equation}

The functions $f_1:\reals\to\reals$ and $f_2:\reals\to\reals$ defined by $f_1(z) = -\log(\sigma(z))$ and $f_2(z) = -\log(1-\sigma(z))$ respectively are convex functions.
A (twice-differentiable) convex function of an affine function is a convex function.

Proof) First, we show that $f_1$ and $f_2$ are convex functions. Since \begin{eqnarray} f_1(z) = -\log(1/(1+\exp(-z))) = \log(1+\exp(-z)), \end{eqnarray} \begin{eqnarray} \frac{d}{dz} f_1(z) = -\exp(-z)/(1+\exp(-z)) = -1 + 1/(1+exp(-z)) = -1 + \sigma(z), \end{eqnarray} the derivative of $f_1$ is a monotonically increasing function and $f_1$ function is a (strictly) convex function (Wiki page for convex function).

Likewise, since \begin{eqnarray} f_2(z) = -\log(\exp(-z)/(1+\exp(-z))) = \log(1+\exp(-z)) +z = f_1(z) + z \end{eqnarray} \begin{eqnarray} \frac{d}{dz} f_2(z) = \frac{d}{dz} f_1(z) + 1. \end{eqnarray} Since the derivative of $f_1$ is a monotonically increasing function, that of $f_2$ is also a monotonically increasing function, hence $f_2$ is a (strictly) convex function, hence the proof.

Now we prove the second claim. Let $f:\reals^m\to\reals$ is a twice-differential convex function, $A\in\reals^{m\times n}$, and $b\in\reals^m$. And let $g:\reals^n\to\reals$ such that $g(y) = f(Ay + b)$. Then the gradient of $g$ with respect to $y$ is \begin{equation} \nabla_y g(y) = A^T \nabla_x f(Ay+b) \in \reals^n, \end{equation} and the Hessian of $g$ with respect to $y$ is \begin{equation} \nabla_y^2 g(y) = A^T \nabla_x^2 f(Ay+b) A \in \reals^{n \times n}. \end{equation} Since $f$ is a convex function, $\nabla_x^2 f(x) \succeq 0$, i.e., it is a positive semidefinite matrix for all $x\in\reals^m$. Then for any $z\in\reals^n$, \begin{equation} z^T \nabla_y^2 g(y) z = z^T A^T \nabla_x^2 f(Ay+b) A z = (Az)^T \nabla_x^2 f(Ay+b) (A z) \geq 0, \end{equation} hence $\nabla_y^2 g(y)$ is also a positive semidefinite matrix for all $y\in\reals^n$ (Wiki page for convex function).

Now the object function to be minimized for logistic regression is \begin{equation} \begin{array}{ll} \mbox{minimize} & L(\theta) = \sum_{i=1}^N \left( - y^i \log(\sigma(\theta^T x^i + \theta_0)) - (1-y^i) \log(1-\sigma(\theta^T x^i + \theta_0)) \right) \end{array} \end{equation} where $(x^i, y^i)$ for $i=1,\ldots, N$ are $N$ training data. Now this is the sum of convex functions of linear (hence, affine) functions in $(\theta, \theta_0)$. Since the sum of convex functions is a convex function, this problem is a convex optimization.

Note that if it maximized the loss function, it would NOT be a convex optimization function. So the direction is critical!

Note also that, whether the algorithm we use is stochastic gradient descent, just gradient descent, or any other optimization algorithm, it solves the convex optimization problem, and that even if we use nonconvex nonlinear kernels for feature transformation, it is still a convex optimization problem since the loss function is still a convex function in $(\theta, \theta_0)$.

Now the new loss function proposed by the questioner is \begin{equation} L(\theta, \theta_0) = \sum_{i=1}^N \left( y^i (1-\sigma(\theta^T x^i + \theta_0))^2 + (1-y^i) \sigma(\theta^T x^i + \theta_0)^2 \right) \end{equation}

First we show that $f(z) = \sigma(z)^2$ is not a convex function in $z$. If we differentiate this function, we have \begin{equation} f'(z) = \frac{d}{dz} \sigma(z)^2 = 2 \sigma(z) \frac{d}{dz} \sigma(z) = 2 \exp(-z) / (1+\exp(-z))^3. \end{equation} Since $f'(0)=1$ and $\lim_{z\to\infty} f'(z) = 0$ (and f'(z) is differentiable), the mean value theorem implies that there exists $z_0\geq0$ such that $f'(z_0) < 0$. Therefore $f(z)$ is NOT a convex function.

Now if we let $N=1$, $x^1 = 1$, $y^1 = 0$, $\theta_0=0$, and $\theta\in\reals$, $L(\theta, 0) = \sigma(\theta)^2$, hence $L(\theta,0)$ is not a convex function, hence the proof!

However, solving the non-convex optimization problem using gradient descent is not necessarily bad idea. (Almost) all deep learning problem is solved by stochastic gradient descent because it's the only way to solve it (other than evolutionary algorithms).

I hope this is a self-contained (strict) proof for the argument. Please leave feedback if anything is unclear or I made mistakes.

Thank you. - Sunghee

Best Answer

Related Solutions

Machine Learning – Derivative of Cost Function for Logistic Regression

[Math] Logistic regression – Prove That the Cost Function Is Convex

Related Question