The reason is the following. We use the notation:
$$\theta x^i:=\theta_0+\theta_1 x^i_1+\dots+\theta_p x^i_p.$$
Then
$$\log h_\theta(x^i)=\log\frac{1}{1+e^{-\theta x^i} }=-\log ( 1+e^{-\theta x^i} ),$$ $$\log(1- h_\theta(x^i))=\log(1-\frac{1}{1+e^{-\theta x^i} })=\log (e^{-\theta x^i} )-\log ( 1+e^{-\theta x^i} )=-\theta x^i-\log ( 1+e^{-\theta x^i} ),$$ [ this used: $ 1 = \frac{(1+e^{-\theta x^i})}{(1+e^{-\theta x^i})},$ the 1's in numerator cancel, then we used: $\log(x/y) = \log(x) - \log(y)$]
Since our original cost function is the form of:
$$J(\theta)=-\frac{1}{m}\sum_{i=1}^{m}y^{i}\log(h_\theta(x^{i}))+(1-y^{i})\log(1-h_\theta(x^{i}))$$
Plugging in the two simplified expressions above, we obtain
$$J(\theta)=-\frac{1}{m}\sum_{i=1}^m \left[-y^i(\log ( 1+e^{-\theta x^i})) + (1-y^i)(-\theta x^i-\log ( 1+e^{-\theta x^i} ))\right]$$, which can be simplified to:
$$J(\theta)=-\frac{1}{m}\sum_{i=1}^m \left[y_i\theta x^i-\theta x^i-\log(1+e^{-\theta x^i})\right]=-\frac{1}{m}\sum_{i=1}^m \left[y_i\theta x^i-\log(1+e^{\theta x^i})\right],~~(*)$$
where the second equality follows from
$$-\theta x^i-\log(1+e^{-\theta x^i})=
-\left[ \log e^{\theta x^i}+
\log(1+e^{-\theta x^i} )
\right]=-\log(1+e^{\theta x^i}). $$ [ we used $ \log(x) + \log(y) = log(x y) $ ]
All you need now is to compute the partial derivatives of $(*)$ w.r.t. $\theta_j$. As
$$\frac{\partial}{\partial \theta_j}y_i\theta x^i=y_ix^i_j, $$
$$\frac{\partial}{\partial \theta_j}\log(1+e^{\theta x^i})=\frac{x^i_je^{\theta x^i}}{1+e^{\theta x^i}}=x^i_jh_\theta(x^i),$$
the thesis follows.
Here I will prove the below loss function is a convex function.
\begin{equation}
L(\theta, \theta_0) = \sum_{i=1}^N \left( - y^i \log(\sigma(\theta^T x^i + \theta_0))
- (1-y^i) \log(1-\sigma(\theta^T x^i + \theta_0))
\right)
\end{equation}
Then will show that the loss function below that the questioner proposed is NOT a convex function.
\begin{equation}
L(\theta, \theta_0) = \sum_{i=1}^N \left( y^i (1-\sigma(\theta^T x^i + \theta_0))^2
+ (1-y^i) \sigma(\theta^T x^i + \theta_0)^2
\right)
\end{equation}
To prove that solving a logistic regression using the first loss function is solving a convex optimization problem, we need two facts (to prove).
$
\newcommand{\reals}{{\mathbf{R}}}
\newcommand{\preals}{{\reals_+}}
\newcommand{\ppreals}{{\reals_{++}}}
$
Suppose that $\sigma: \reals \to \ppreals$ is the sigmoid function defined by
\begin{equation}
\sigma(z) = 1/(1+\exp(-z))
\end{equation}
The functions $f_1:\reals\to\reals$ and $f_2:\reals\to\reals$ defined by $f_1(z) = -\log(\sigma(z))$ and $f_2(z) = -\log(1-\sigma(z))$ respectively are convex functions.
A (twice-differentiable) convex function of an affine function is a convex function.
Proof) First, we show that $f_1$ and $f_2$ are convex functions. Since
\begin{eqnarray}
f_1(z) = -\log(1/(1+\exp(-z))) = \log(1+\exp(-z)),
\end{eqnarray}
\begin{eqnarray}
\frac{d}{dz} f_1(z) = -\exp(-z)/(1+\exp(-z)) = -1 + 1/(1+exp(-z)) = -1 + \sigma(z),
\end{eqnarray}
the derivative of $f_1$ is a monotonically increasing function and $f_1$ function is a (strictly) convex function (Wiki page for convex function).
Likewise, since
\begin{eqnarray}
f_2(z) = -\log(\exp(-z)/(1+\exp(-z))) = \log(1+\exp(-z)) +z = f_1(z) + z
\end{eqnarray}
\begin{eqnarray}
\frac{d}{dz} f_2(z) = \frac{d}{dz} f_1(z) + 1.
\end{eqnarray}
Since the derivative of $f_1$ is a monotonically increasing function, that of $f_2$ is also a monotonically increasing function, hence $f_2$ is a (strictly) convex function, hence the proof.
Now we prove the second claim. Let $f:\reals^m\to\reals$ is a twice-differential convex function, $A\in\reals^{m\times n}$, and $b\in\reals^m$. And let $g:\reals^n\to\reals$ such that $g(y) = f(Ay + b)$. Then the gradient of $g$ with respect to $y$ is
\begin{equation}
\nabla_y g(y) = A^T \nabla_x f(Ay+b) \in \reals^n,
\end{equation}
and the Hessian of $g$ with respect to $y$ is
\begin{equation}
\nabla_y^2 g(y) = A^T \nabla_x^2 f(Ay+b) A \in \reals^{n \times n}.
\end{equation}
Since $f$ is a convex function, $\nabla_x^2 f(x) \succeq 0$, i.e., it is a positive semidefinite matrix for all $x\in\reals^m$. Then for any $z\in\reals^n$,
\begin{equation}
z^T \nabla_y^2 g(y) z = z^T A^T \nabla_x^2 f(Ay+b) A z
= (Az)^T \nabla_x^2 f(Ay+b) (A z) \geq 0,
\end{equation}
hence $\nabla_y^2 g(y)$ is also a positive semidefinite matrix for all $y\in\reals^n$ (Wiki page for convex function).
Now the object function to be minimized for logistic regression is
\begin{equation}
\begin{array}{ll}
\mbox{minimize} &
L(\theta) = \sum_{i=1}^N \left( - y^i \log(\sigma(\theta^T x^i + \theta_0))
- (1-y^i) \log(1-\sigma(\theta^T x^i + \theta_0))
\right)
\end{array}
\end{equation}
where $(x^i, y^i)$ for $i=1,\ldots, N$ are $N$ training data. Now this is the sum of convex functions of linear (hence, affine) functions in $(\theta, \theta_0)$. Since the sum of convex functions is a convex function, this problem is a convex optimization.
Note that if it maximized the loss function, it would NOT be a convex optimization function. So the direction is critical!
Note also that, whether the algorithm we use is stochastic gradient descent, just gradient descent, or any other optimization algorithm, it solves the convex optimization problem, and that even if we use nonconvex nonlinear kernels for feature transformation, it is still a convex optimization problem since the loss function is still a convex function in $(\theta, \theta_0)$.
Now the new loss function proposed by the questioner is
\begin{equation}
L(\theta, \theta_0) = \sum_{i=1}^N \left( y^i (1-\sigma(\theta^T x^i + \theta_0))^2
+ (1-y^i) \sigma(\theta^T x^i + \theta_0)^2
\right)
\end{equation}
First we show that $f(z) = \sigma(z)^2$ is not a convex function in $z$. If we differentiate this function, we have
\begin{equation}
f'(z) = \frac{d}{dz} \sigma(z)^2 = 2 \sigma(z) \frac{d}{dz} \sigma(z)
= 2 \exp(-z) / (1+\exp(-z))^3.
\end{equation}
Since $f'(0)=1$ and $\lim_{z\to\infty} f'(z) = 0$ (and f'(z) is differentiable), the mean value theorem implies that there exists $z_0\geq0$ such that $f'(z_0) < 0$. Therefore $f(z)$ is NOT a convex function.
Now if we let $N=1$, $x^1 = 1$, $y^1 = 0$, $\theta_0=0$, and $\theta\in\reals$, $L(\theta, 0) = \sigma(\theta)^2$, hence $L(\theta,0)$ is not a convex function, hence the proof!
However, solving the non-convex optimization problem using gradient descent is not necessarily bad idea. (Almost) all deep learning problem is solved by stochastic gradient descent because it's the only way to solve it (other than evolutionary algorithms).
I hope this is a self-contained (strict) proof for the argument. Please leave feedback if anything is unclear or I made mistakes.
Thank you. - Sunghee
Best Answer
Lets try to derive why the logarithm comes in the cost function of logistic regression from first principles.
So we have a dataset X consisting of m datapoints and n features. And there is a class variable y a vector of length m which can have two values 1 or 0.
Now logistic regression says that the probability that class variable value $y_i =1$ , $i=1,2,...m$ can be modelled as follows
$$ P( y_i =1 | \mathbf{x}_i ; \theta) = h_{\theta}(\mathbf{x}_i) = \dfrac{1}{1+e^{(- \theta^T \mathbf{x}_i)}} $$
so $y_i = 1$ with probability $h_{\theta}(\mathbf{x}_i)$ and $y_i=0$ with probability $1-h_{\theta}(\mathbf{x}_i)$.
This can be combined into a single equation as follows, ( actually $y_i$ follows a Bernoulli distribution)
$$ P(y_i ) = h_{\theta}(\mathbf{x}_i)^{y_i} (1 - h_{\theta}(\mathbf{x}_i))^{1-y_i}$$
$P(y_i)$ is known as the likelihood of single data point $\mathbf{x}_i$, i.e. given the value of $y_i$ what is the probability of $\mathbf{x}_i$ occurring. it is the conditional probability $P(\mathbf{x}_i | y_i)$.
The likelihood of the entire dataset $\mathbf{X}$ is the product of the individual data point likelihoods. Thus
$$ P(\mathbf{X}|\mathbf{y}) = \prod_{i=1}^{m} P(\mathbf{x}_i | y_i) = \prod_{i=1}^{m} h_{\theta}(\mathbf{x}_i)^{y_i} (1 - h_{\theta}(\mathbf{x}_i))^{1-y_i}$$
Now the principle of maximum likelihood says that we find the parameters that maximise likelihood $P(\mathbf{X}|\mathbf{y})$.
As mentioned in the comment, logarithms are used because they convert products into sums and do not alter the maximization search, as they are monotone increasing functions. Here too we have a product form in the likelihood.So we take the natural logarithm as maximising the likelihood is same as maximising the log likelihood, so log likelihood $L(\theta)$ is now:
$$ L(\theta) = \log(P(\mathbf{X}|\mathbf{y}) = \sum_{i=1}^{m} y_i \log(h_{\theta}(\mathbf{x}_i)) + (1-y_i) \log(1 - h_{\theta}(\mathbf{x}_i)) $$.
Since in linear regression we found the $\theta$ that minimizes our cost function , here too for the sake of consistency, we would like to have a minimization problem. And we want the average cost over all the data points. Currently, we have a maximimzation of $L(\theta)$. Maximization of $L( \theta)$ is equivalent to minimization of $ -L(\theta)$. And using the average cost over all data points, our cost function for logistic regresion comes out to be,
$$ J(\theta) = - \dfrac{1}{m} L(\theta)$$
$$ = - \dfrac{1}{m} \left( \sum_{i=1}^{m} y_i \log (h_{\theta}(\mathbf{x}_i)) + (1-y_i) \log (1 - h_{\theta}(\mathbf{x}_i)) \right )$$
Now we can also understand why the cost for single data point comes as follows:
the cost for a single data point is $ = -\log( P(\mathbf{x}_i | y_i))$, which can be written as $ - \left ( y_i \log (h_{\theta}(\mathbf{x}_i)) + (1 - y_i) \log (1 - h_{\theta}(\mathbf{x}_i) \right )$.
We can now split the above into two depending upon the value of $y_i$. Thus we get
$J(h_{\theta}(\mathbf{x}_i), y_i) = - \log (h_{\theta}(\mathbf{x}_i)) , \text{ if } y_i=1$, and
$J(h_{\theta}(\mathbf{x}_i), y_i) = - \log (1 - (h_{\theta}(\mathbf{x}_i) ) , \text{ if } y_i=0 $.