We are given a data vector $\textbf{x}$ and a class vector $\textbf{y}$. The class vector tells us which of two classes $\{0,1\}$ the data instances belong to.
We want to come up with a function $h_\theta(x_i)$ that helps us estimate the classes $y_i$ as best as we can.
You can think of $h_\theta(x_i)$ as the probability that $y_i=1$, given $x_i$ and $\theta$.
$$P(y_i=1|x_i,\theta)=h_\theta(x_i)$$
Likewise, $1-h_\theta(x_i)$ is the probability that $y_i=0$, given $x_i$ and $\theta$.
$$P(y_i=0|x_i,\theta)=1-h_\theta(x_i)$$
We can combine the two formulas in a clever way using exponents:
$$P(y_i|x_i,\theta)=h_\theta(x_i)^{y_i}[1-h_\theta(x_i)]^{1-y_i}$$
(Note that one of the terms is always reduced to 1 because one of the exponents is always zero.)
Since all instances are independent, the total probability over all instances $i$ is just the product of all the individual probabilities:
$$P(\textbf{y}|\textbf{x},\theta)=\prod_i h_\theta(x_i)^{y_i}[1-h_\theta(x_i)]^{1-y_i}$$
We are hoping to maximize the probability of the output vector $\textbf{y}$, or equivalently, maximize its $\textbf{log}$.
$$\log\big( P(\textbf{y}|\textbf{x},\theta)\big)=\log\big(\prod_i h_\theta(x_i)^{y_i}[1-h_\theta(x_i)]^{1-y_i}\big)$$
$$=\sum_i \log\big(h_\theta(x_i)^{y_i}[1-h_\theta(x_i)]^{1-y_i}\big)$$
$$=\sum_i \big[y_i\log(h_\theta(x_i)) + (1-y_i)\log(1-h_\theta(x_i))\big]$$
This is a function with a maximum, but since we want to use gradient descent, we can just throw a negative sign in front to turn the maximum into a minimum, and scale it by the number of instances $m$ for convenience. (This makes the error more or less invariant to the number of instances.)A negative sign is added to the front since we want to reduce the cost function. The log loss function is strictly increasing, but by adding a negative sign we invert it.
$$J(\theta) = -\frac{1}{m}\sum_i^m \big[y_i\log(h_\theta(x_i)) + (1-y_i)\log(1-h_\theta(x_i))\big]$$
Why did we bother taking the $\log$? Because it's easier to take the derivative of a sum rather than the derivative of a product (imagine all that product rule!). You'll find this trick is used a lot in machine learning to make differentiation easier.
You also asked why we chose $h_\theta(x_i)$ to be the sigmoid function. A couple reasons are:
- It is differentiable (unlike the unit step function), so we can use gradient descent with it.
- Its domain is the whole real line $\mathbb{R}$ and its range is $[0,1]$, which seems like a good idea for a binary classification problem with no constraints on the input value.
To me it seems like the use of sigmoid is more of an engineered solution rather than something we arrived at from a mathematical proof. It has nice properties and seems to work.
In neural networks, some people prefer using alternatives to the sigmoid like $arctan$ and $tanh$, but I don't think it makes that much of a difference in most cases.
Here I will prove the below loss function is a convex function.
\begin{equation}
L(\theta, \theta_0) = \sum_{i=1}^N \left( - y^i \log(\sigma(\theta^T x^i + \theta_0))
- (1-y^i) \log(1-\sigma(\theta^T x^i + \theta_0))
\right)
\end{equation}
Then will show that the loss function below that the questioner proposed is NOT a convex function.
\begin{equation}
L(\theta, \theta_0) = \sum_{i=1}^N \left( y^i (1-\sigma(\theta^T x^i + \theta_0))^2
+ (1-y^i) \sigma(\theta^T x^i + \theta_0)^2
\right)
\end{equation}
To prove that solving a logistic regression using the first loss function is solving a convex optimization problem, we need two facts (to prove).
$
\newcommand{\reals}{{\mathbf{R}}}
\newcommand{\preals}{{\reals_+}}
\newcommand{\ppreals}{{\reals_{++}}}
$
Suppose that $\sigma: \reals \to \ppreals$ is the sigmoid function defined by
\begin{equation}
\sigma(z) = 1/(1+\exp(-z))
\end{equation}
The functions $f_1:\reals\to\reals$ and $f_2:\reals\to\reals$ defined by $f_1(z) = -\log(\sigma(z))$ and $f_2(z) = -\log(1-\sigma(z))$ respectively are convex functions.
A (twice-differentiable) convex function of an affine function is a convex function.
Proof) First, we show that $f_1$ and $f_2$ are convex functions. Since
\begin{eqnarray}
f_1(z) = -\log(1/(1+\exp(-z))) = \log(1+\exp(-z)),
\end{eqnarray}
\begin{eqnarray}
\frac{d}{dz} f_1(z) = -\exp(-z)/(1+\exp(-z)) = -1 + 1/(1+exp(-z)) = -1 + \sigma(z),
\end{eqnarray}
the derivative of $f_1$ is a monotonically increasing function and $f_1$ function is a (strictly) convex function (Wiki page for convex function).
Likewise, since
\begin{eqnarray}
f_2(z) = -\log(\exp(-z)/(1+\exp(-z))) = \log(1+\exp(-z)) +z = f_1(z) + z
\end{eqnarray}
\begin{eqnarray}
\frac{d}{dz} f_2(z) = \frac{d}{dz} f_1(z) + 1.
\end{eqnarray}
Since the derivative of $f_1$ is a monotonically increasing function, that of $f_2$ is also a monotonically increasing function, hence $f_2$ is a (strictly) convex function, hence the proof.
Now we prove the second claim. Let $f:\reals^m\to\reals$ is a twice-differential convex function, $A\in\reals^{m\times n}$, and $b\in\reals^m$. And let $g:\reals^n\to\reals$ such that $g(y) = f(Ay + b)$. Then the gradient of $g$ with respect to $y$ is
\begin{equation}
\nabla_y g(y) = A^T \nabla_x f(Ay+b) \in \reals^n,
\end{equation}
and the Hessian of $g$ with respect to $y$ is
\begin{equation}
\nabla_y^2 g(y) = A^T \nabla_x^2 f(Ay+b) A \in \reals^{n \times n}.
\end{equation}
Since $f$ is a convex function, $\nabla_x^2 f(x) \succeq 0$, i.e., it is a positive semidefinite matrix for all $x\in\reals^m$. Then for any $z\in\reals^n$,
\begin{equation}
z^T \nabla_y^2 g(y) z = z^T A^T \nabla_x^2 f(Ay+b) A z
= (Az)^T \nabla_x^2 f(Ay+b) (A z) \geq 0,
\end{equation}
hence $\nabla_y^2 g(y)$ is also a positive semidefinite matrix for all $y\in\reals^n$ (Wiki page for convex function).
Now the object function to be minimized for logistic regression is
\begin{equation}
\begin{array}{ll}
\mbox{minimize} &
L(\theta) = \sum_{i=1}^N \left( - y^i \log(\sigma(\theta^T x^i + \theta_0))
- (1-y^i) \log(1-\sigma(\theta^T x^i + \theta_0))
\right)
\end{array}
\end{equation}
where $(x^i, y^i)$ for $i=1,\ldots, N$ are $N$ training data. Now this is the sum of convex functions of linear (hence, affine) functions in $(\theta, \theta_0)$. Since the sum of convex functions is a convex function, this problem is a convex optimization.
Note that if it maximized the loss function, it would NOT be a convex optimization function. So the direction is critical!
Note also that, whether the algorithm we use is stochastic gradient descent, just gradient descent, or any other optimization algorithm, it solves the convex optimization problem, and that even if we use nonconvex nonlinear kernels for feature transformation, it is still a convex optimization problem since the loss function is still a convex function in $(\theta, \theta_0)$.
Now the new loss function proposed by the questioner is
\begin{equation}
L(\theta, \theta_0) = \sum_{i=1}^N \left( y^i (1-\sigma(\theta^T x^i + \theta_0))^2
+ (1-y^i) \sigma(\theta^T x^i + \theta_0)^2
\right)
\end{equation}
First we show that $f(z) = \sigma(z)^2$ is not a convex function in $z$. If we differentiate this function, we have
\begin{equation}
f'(z) = \frac{d}{dz} \sigma(z)^2 = 2 \sigma(z) \frac{d}{dz} \sigma(z)
= 2 \exp(-z) / (1+\exp(-z))^3.
\end{equation}
Since $f'(0)=1$ and $\lim_{z\to\infty} f'(z) = 0$ (and f'(z) is differentiable), the mean value theorem implies that there exists $z_0\geq0$ such that $f'(z_0) < 0$. Therefore $f(z)$ is NOT a convex function.
Now if we let $N=1$, $x^1 = 1$, $y^1 = 0$, $\theta_0=0$, and $\theta\in\reals$, $L(\theta, 0) = \sigma(\theta)^2$, hence $L(\theta,0)$ is not a convex function, hence the proof!
However, solving the non-convex optimization problem using gradient descent is not necessarily bad idea. (Almost) all deep learning problem is solved by stochastic gradient descent because it's the only way to solve it (other than evolutionary algorithms).
I hope this is a self-contained (strict) proof for the argument. Please leave feedback if anything is unclear or I made mistakes.
Thank you. - Sunghee
Best Answer
My example is probably not the best but it still show you that the linear regression cost function applied to logistic regression can give local minima.
Let us consider a training set with 3 examples in 1D: $$\{(-1,2), (-20,-1), (-5,5)\}$$
For the sake of simplicity let us consider a model without constant term so that we get the activation function $$h_{\theta}(x)=\dfrac{1}{1+e^{\theta \cdot x}}$$
If we use the cost function of a linear regression we would get
$$J(\theta)=\dfrac{1}{6}\left(\left(\dfrac{1}{1+e^{-\theta}}-2\right)^2 + \left(\dfrac{1}{1+e^{-20\theta}}+1\right)^2+\left(\dfrac{1}{1+e^{-5\theta}}-5\right)^2\right)$$
Let us now plot this function. You need to zoom a bit around $0$ to see the following graph
There is clearly a local minimum that we could reach. This is one of the reason you should use the classical cost function for logistic regression.
Besides I am sure that you should be able to find a nicer graph with more local minima when using a bigger value for $n$.