During a lecture on logistic regression in coursera, there was a question about the cost function as shown in the following images:
I don't understand why the third unselected option is true?
logistic regressionmachine learning
During a lecture on logistic regression in coursera, there was a question about the cost function as shown in the following images:
I don't understand why the third unselected option is true?
We are given a data vector $\textbf{x}$ and a class vector $\textbf{y}$. The class vector tells us which of two classes $\{0,1\}$ the data instances belong to.
We want to come up with a function $h_\theta(x_i)$ that helps us estimate the classes $y_i$ as best as we can.
You can think of $h_\theta(x_i)$ as the probability that $y_i=1$, given $x_i$ and $\theta$.
$$P(y_i=1|x_i,\theta)=h_\theta(x_i)$$
Likewise, $1-h_\theta(x_i)$ is the probability that $y_i=0$, given $x_i$ and $\theta$.
$$P(y_i=0|x_i,\theta)=1-h_\theta(x_i)$$
We can combine the two formulas in a clever way using exponents:
$$P(y_i|x_i,\theta)=h_\theta(x_i)^{y_i}[1-h_\theta(x_i)]^{1-y_i}$$
(Note that one of the terms is always reduced to 1 because one of the exponents is always zero.)
Since all instances are independent, the total probability over all instances $i$ is just the product of all the individual probabilities:
$$P(\textbf{y}|\textbf{x},\theta)=\prod_i h_\theta(x_i)^{y_i}[1-h_\theta(x_i)]^{1-y_i}$$
We are hoping to maximize the probability of the output vector $\textbf{y}$, or equivalently, maximize its $\textbf{log}$.
$$\log\big( P(\textbf{y}|\textbf{x},\theta)\big)=\log\big(\prod_i h_\theta(x_i)^{y_i}[1-h_\theta(x_i)]^{1-y_i}\big)$$
$$=\sum_i \log\big(h_\theta(x_i)^{y_i}[1-h_\theta(x_i)]^{1-y_i}\big)$$
$$=\sum_i \big[y_i\log(h_\theta(x_i)) + (1-y_i)\log(1-h_\theta(x_i))\big]$$
This is a function with a maximum, but since we want to use gradient descent, we can just throw a negative sign in front to turn the maximum into a minimum, and scale it by the number of instances $m$ for convenience. (This makes the error more or less invariant to the number of instances.)A negative sign is added to the front since we want to reduce the cost function. The log loss function is strictly increasing, but by adding a negative sign we invert it.
$$J(\theta) = -\frac{1}{m}\sum_i^m \big[y_i\log(h_\theta(x_i)) + (1-y_i)\log(1-h_\theta(x_i))\big]$$
Why did we bother taking the $\log$? Because it's easier to take the derivative of a sum rather than the derivative of a product (imagine all that product rule!). You'll find this trick is used a lot in machine learning to make differentiation easier.
You also asked why we chose $h_\theta(x_i)$ to be the sigmoid function. A couple reasons are:
To me it seems like the use of sigmoid is more of an engineered solution rather than something we arrived at from a mathematical proof. It has nice properties and seems to work.
In neural networks, some people prefer using alternatives to the sigmoid like $arctan$ and $tanh$, but I don't think it makes that much of a difference in most cases.
Here I will prove the below loss function is a convex function. \begin{equation} L(\theta, \theta_0) = \sum_{i=1}^N \left( - y^i \log(\sigma(\theta^T x^i + \theta_0)) - (1-y^i) \log(1-\sigma(\theta^T x^i + \theta_0)) \right) \end{equation}
Then will show that the loss function below that the questioner proposed is NOT a convex function.
\begin{equation} L(\theta, \theta_0) = \sum_{i=1}^N \left( y^i (1-\sigma(\theta^T x^i + \theta_0))^2 + (1-y^i) \sigma(\theta^T x^i + \theta_0)^2 \right) \end{equation}
To prove that solving a logistic regression using the first loss function is solving a convex optimization problem, we need two facts (to prove).
$ \newcommand{\reals}{{\mathbf{R}}} \newcommand{\preals}{{\reals_+}} \newcommand{\ppreals}{{\reals_{++}}} $
Suppose that $\sigma: \reals \to \ppreals$ is the sigmoid function defined by
\begin{equation} \sigma(z) = 1/(1+\exp(-z)) \end{equation}
The functions $f_1:\reals\to\reals$ and $f_2:\reals\to\reals$ defined by $f_1(z) = -\log(\sigma(z))$ and $f_2(z) = -\log(1-\sigma(z))$ respectively are convex functions.
A (twice-differentiable) convex function of an affine function is a convex function.
Proof) First, we show that $f_1$ and $f_2$ are convex functions. Since \begin{eqnarray} f_1(z) = -\log(1/(1+\exp(-z))) = \log(1+\exp(-z)), \end{eqnarray} \begin{eqnarray} \frac{d}{dz} f_1(z) = -\exp(-z)/(1+\exp(-z)) = -1 + 1/(1+exp(-z)) = -1 + \sigma(z), \end{eqnarray} the derivative of $f_1$ is a monotonically increasing function and $f_1$ function is a (strictly) convex function (Wiki page for convex function).
Likewise, since \begin{eqnarray} f_2(z) = -\log(\exp(-z)/(1+\exp(-z))) = \log(1+\exp(-z)) +z = f_1(z) + z \end{eqnarray} \begin{eqnarray} \frac{d}{dz} f_2(z) = \frac{d}{dz} f_1(z) + 1. \end{eqnarray} Since the derivative of $f_1$ is a monotonically increasing function, that of $f_2$ is also a monotonically increasing function, hence $f_2$ is a (strictly) convex function, hence the proof.
Now we prove the second claim. Let $f:\reals^m\to\reals$ is a twice-differential convex function, $A\in\reals^{m\times n}$, and $b\in\reals^m$. And let $g:\reals^n\to\reals$ such that $g(y) = f(Ay + b)$. Then the gradient of $g$ with respect to $y$ is \begin{equation} \nabla_y g(y) = A^T \nabla_x f(Ay+b) \in \reals^n, \end{equation} and the Hessian of $g$ with respect to $y$ is \begin{equation} \nabla_y^2 g(y) = A^T \nabla_x^2 f(Ay+b) A \in \reals^{n \times n}. \end{equation} Since $f$ is a convex function, $\nabla_x^2 f(x) \succeq 0$, i.e., it is a positive semidefinite matrix for all $x\in\reals^m$. Then for any $z\in\reals^n$, \begin{equation} z^T \nabla_y^2 g(y) z = z^T A^T \nabla_x^2 f(Ay+b) A z = (Az)^T \nabla_x^2 f(Ay+b) (A z) \geq 0, \end{equation} hence $\nabla_y^2 g(y)$ is also a positive semidefinite matrix for all $y\in\reals^n$ (Wiki page for convex function).
Now the object function to be minimized for logistic regression is \begin{equation} \begin{array}{ll} \mbox{minimize} & L(\theta) = \sum_{i=1}^N \left( - y^i \log(\sigma(\theta^T x^i + \theta_0)) - (1-y^i) \log(1-\sigma(\theta^T x^i + \theta_0)) \right) \end{array} \end{equation} where $(x^i, y^i)$ for $i=1,\ldots, N$ are $N$ training data. Now this is the sum of convex functions of linear (hence, affine) functions in $(\theta, \theta_0)$. Since the sum of convex functions is a convex function, this problem is a convex optimization.
Note that if it maximized the loss function, it would NOT be a convex optimization function. So the direction is critical!
Note also that, whether the algorithm we use is stochastic gradient descent, just gradient descent, or any other optimization algorithm, it solves the convex optimization problem, and that even if we use nonconvex nonlinear kernels for feature transformation, it is still a convex optimization problem since the loss function is still a convex function in $(\theta, \theta_0)$.
Now the new loss function proposed by the questioner is \begin{equation} L(\theta, \theta_0) = \sum_{i=1}^N \left( y^i (1-\sigma(\theta^T x^i + \theta_0))^2 + (1-y^i) \sigma(\theta^T x^i + \theta_0)^2 \right) \end{equation}
First we show that $f(z) = \sigma(z)^2$ is not a convex function in $z$. If we differentiate this function, we have \begin{equation} f'(z) = \frac{d}{dz} \sigma(z)^2 = 2 \sigma(z) \frac{d}{dz} \sigma(z) = 2 \exp(-z) / (1+\exp(-z))^3. \end{equation} Since $f'(0)=1$ and $\lim_{z\to\infty} f'(z) = 0$ (and f'(z) is differentiable), the mean value theorem implies that there exists $z_0\geq0$ such that $f'(z_0) < 0$. Therefore $f(z)$ is NOT a convex function.
Now if we let $N=1$, $x^1 = 1$, $y^1 = 0$, $\theta_0=0$, and $\theta\in\reals$, $L(\theta, 0) = \sigma(\theta)^2$, hence $L(\theta,0)$ is not a convex function, hence the proof!
However, solving the non-convex optimization problem using gradient descent is not necessarily bad idea. (Almost) all deep learning problem is solved by stochastic gradient descent because it's the only way to solve it (other than evolutionary algorithms).
I hope this is a self-contained (strict) proof for the argument. Please leave feedback if anything is unclear or I made mistakes.
Thank you. - Sunghee
Best Answer
For logistic regression, $h_\theta(x)=\frac{1}{1 + e^{-\theta^{T}x}}$ is the sigmoidal function. So, $-\log(1 - h_\theta(x))$ = $-\log{\frac{1}{1 + e^{\theta^{T}{x}}}}$. So, if $h_\theta(x)$ tends to $0$ then $e^{\theta^{T}x} = 0$, So, the cost should be $0$ should not be $-\infty$, which makes sense since the cost should be minimum for correct prediction where $h_\theta(x)$ matches the actual $y$ value.