Maybe I will make some changes in notation to clarify. To brief answer go to the end of the anwswer (Summarizing).
We need to find the parameters $w_i$, $i=1,2,\dotsc$, for a linear function (known as hyptohesis function).
$$h_w(x)=w_0+w_1x$$
where $x$ is your input training value $V_{train}(b)$ and $h_w(x)$ is the output of your learned function $V'(b)$. Note that I using only $w_0,w_1$ since your training set have the form $(x,y)$; for linear regression with $n$ parameters you need a hyptohesis function like
$$h_w(x)=w_0+w_1x_1+w_2x_2+\dotsm+w_nx_n.$$
Also, $w_0$ is your bias, axis intersection or intercept term, i.e., if the formula of linear function is $y=a+bx$, then here $w_0$ is equivalent to $a$.
Now you need to find the values of $w_0$ and $w_1$. To work less you can use matrix-vector operations. So if you have two vectors $\mathbf w, \mathbf x$ such that
$$\mathbf w=\begin{pmatrix}w_0\\w_1\\\vdots\\w_n\end{pmatrix}\quad\text{and}\quad\mathbf x=\begin{pmatrix}x_0\\x_1\\\vdots\\x_n\end{pmatrix},$$
you can obtain their sum by
$$\mathbf w^T\mathbf x=\begin{pmatrix}w_0\;w_1\;\dotsm\;w_n\end{pmatrix}\begin{pmatrix}x_0\\x_1\\\vdots\\x_n\end{pmatrix}=w_0x_0+w_1x_1+w_2x_2+\dotsm+w_nx_n,$$
where $\mathbf w^T$ is the transpose of $\mathbf w$. Note that we use $x_0=1$ to fix the intercept term $w_0$ (since $w_0x_0=w_01=w_0$).
Thus, to simplify our notation, we can computed all trained values by
$$h(\mathbf x)=\sum_{i=0}^n w_ix_i=\mathbf w^T\mathbf x.$$
Finally, suppose we have $m$ elements belongs to our training set. We need to define the cost function (error function) from each value of the paremeters $w$'s to error number in $\mathbf R$ (real numbers):
$$E(w)=\frac{1}{2}\sum_{i=1}^m(h_w(x^{(i)}) - y^{(i)})^2,$$
where $(x^{(i)},y^{(i)})\in \text{Training set}$ (i.e., $(x^{(i)}, y^{(i)})$ is the $i^{th}$-element of traning set. This to avoid confusion: upper-index to the $i^{th}$-element in training set. Lower-index to $i^{th}$-parameter in $h_w$).
LMS algorithm. We consider the gradient descent algorithm, which starts with some initial $w$, and repeatedly performs the update
$$w_j:=w_j-\eta\frac{\partial}{\partial w_j}E(w).$$
(This update is simultaneously, i.e., first compute all errors, then update.)
In order to implement this algorithm to linear function case, we obtain for some
$$\begin{array}{rcl}\frac{\partial}{\partial w_j}E(w)&=&\frac{\partial}{\partial w_j}\frac{1}{2}\sum_{i=1}^m(h_w(x^{(i)}) - y^{(i)})^2\\&=&\frac{1}{2}\sum_{i=1}^m\frac{\partial}{\partial w_j}(h_w(x^{(i)}) - y^{(i)})^2\\&=&\frac{1}{2}\sum_{i=1}^m2(h_w(x^{(i)}) - y^{(i)})x_j\\&=&\frac{1}{2}2\sum_{i=1}^m(h_w(x^{(i)}) - y^{(i)})x_j\\&=&\sum_{i=1}^m(h_w(x^{(i)}) - y^{(i)})x_j.\end{array}$$
Thus, you must update the weight by
$$w_j:=w_j+\eta\sum_{i=1}^m(y^{(i)}-h_w(x^{(i)}))x_j^{(i)}.$$
In similar spirit to $\mathbf w^T\mathbf x$, we can compute the error using matrix-vector (I write your case to right side). Suppose that your training set is
$$X=\begin{pmatrix}-(x^{(1)})^T-\\-(x^{(2)})^T-\\\vdots\\-(x^{(m)})^T-\end{pmatrix}.$$
Also, let $\mathbf y$ be the $m$-dimensionl vector with your target values:
$$\mathbf y=\begin{pmatrix}y^{(1)}\\y^{(2)}\\\vdots\\y^{(m)}\end{pmatrix}.$$
Now, since $h_w(x^{(i)})=(x^{(i)})^T\mathbf w$, we can verify
$$\begin{array}{rcl}X\mathbf w -\mathbf y&=&\begin{pmatrix}(x^{(1)})^T\mathbf w\\\vdots\\(x^{(m)})^T\mathbf w\end{pmatrix}-\begin{pmatrix}y^{(1)}\\\vdots\\y^{(m)}\end{pmatrix}\\&=&\begin{pmatrix}(x^{(1)})^T\mathbf w-y^{(1)}\\\vdots\\(x^{(m)})^T\mathbf w-y^{(m)}\end{pmatrix}.\end{array}
$$
Thus, using $\mathbf v^T\mathbf v = \sum_i\mathbf v_i^2$ for a vector $\mathbf v$, we obtain
$$\begin{array}{rcl}E(w)&=&\frac{1}{2}\sum_{i=1}^m(h_w(x^{(i)}) - y^{(i)})^2\\&=&\frac{1}{2}(X\mathbf w-\mathbf y)^T(X\mathbf w-\mathbf y).\end{array}$$
Summarizing. To obtain $$\mathbf w=\begin{pmatrix}w_0\\w_1\end{pmatrix},$$
proceed:
- Set your collections $x$ and $y$:
$$x=\begin{pmatrix}1\\3\\5\\7\\9\end{pmatrix}\qquad y=\begin{pmatrix}1\\3\\4\\3\\5\end{pmatrix}.$$
Loop until you think that $w_j$ converge (try with $10$ iterations) {
for $i=1$ to $m$ {
$$w_j:=w_j+\eta\sum_{i=1}^m(y^{(i)}-h_w(x^{(i)}))x_j^{(i)}\qquad(\text{for each }j)$$
}
}
Insted you can use
1'. Set your vectors $X$ and $\mathbf y$:
$$X=\begin{pmatrix}1&1\\1&3\\1&5\\1&7\\1&9\end{pmatrix}\qquad\mathbf y=\begin{pmatrix}1\\3\\4\\3\\5\end{pmatrix}.$$
2'. Loop until you think that $\mathbf w$ converge (try with $10$ iterations) {
$$\mathbf w:=\mathbf w-\frac{\eta}{m}\left((X\mathbf w-\mathbf y)^TX\right)$$
}
Note. With the above it can prove a analytical formula to obtain the parameters, using the normal equations to computed $\mathbf w$ by
$$\mathbf w=(X^TX)^{-1}X^T\mathbf y.$$
Obviously you need to use a computer software (like python or octave) or a calculator device (with matrix-vector support) to calculate the values.
I have discussed two ways, using iterations and using vectors. Also, I have avoided some steps about your data. Any questions, for aditional steps, please ask.
Here I will prove the below loss function is a convex function.
\begin{equation}
L(\theta, \theta_0) = \sum_{i=1}^N \left( - y^i \log(\sigma(\theta^T x^i + \theta_0))
- (1-y^i) \log(1-\sigma(\theta^T x^i + \theta_0))
\right)
\end{equation}
Then will show that the loss function below that the questioner proposed is NOT a convex function.
\begin{equation}
L(\theta, \theta_0) = \sum_{i=1}^N \left( y^i (1-\sigma(\theta^T x^i + \theta_0))^2
+ (1-y^i) \sigma(\theta^T x^i + \theta_0)^2
\right)
\end{equation}
To prove that solving a logistic regression using the first loss function is solving a convex optimization problem, we need two facts (to prove).
$
\newcommand{\reals}{{\mathbf{R}}}
\newcommand{\preals}{{\reals_+}}
\newcommand{\ppreals}{{\reals_{++}}}
$
Suppose that $\sigma: \reals \to \ppreals$ is the sigmoid function defined by
\begin{equation}
\sigma(z) = 1/(1+\exp(-z))
\end{equation}
The functions $f_1:\reals\to\reals$ and $f_2:\reals\to\reals$ defined by $f_1(z) = -\log(\sigma(z))$ and $f_2(z) = -\log(1-\sigma(z))$ respectively are convex functions.
A (twice-differentiable) convex function of an affine function is a convex function.
Proof) First, we show that $f_1$ and $f_2$ are convex functions. Since
\begin{eqnarray}
f_1(z) = -\log(1/(1+\exp(-z))) = \log(1+\exp(-z)),
\end{eqnarray}
\begin{eqnarray}
\frac{d}{dz} f_1(z) = -\exp(-z)/(1+\exp(-z)) = -1 + 1/(1+exp(-z)) = -1 + \sigma(z),
\end{eqnarray}
the derivative of $f_1$ is a monotonically increasing function and $f_1$ function is a (strictly) convex function (Wiki page for convex function).
Likewise, since
\begin{eqnarray}
f_2(z) = -\log(\exp(-z)/(1+\exp(-z))) = \log(1+\exp(-z)) +z = f_1(z) + z
\end{eqnarray}
\begin{eqnarray}
\frac{d}{dz} f_2(z) = \frac{d}{dz} f_1(z) + 1.
\end{eqnarray}
Since the derivative of $f_1$ is a monotonically increasing function, that of $f_2$ is also a monotonically increasing function, hence $f_2$ is a (strictly) convex function, hence the proof.
Now we prove the second claim. Let $f:\reals^m\to\reals$ is a twice-differential convex function, $A\in\reals^{m\times n}$, and $b\in\reals^m$. And let $g:\reals^n\to\reals$ such that $g(y) = f(Ay + b)$. Then the gradient of $g$ with respect to $y$ is
\begin{equation}
\nabla_y g(y) = A^T \nabla_x f(Ay+b) \in \reals^n,
\end{equation}
and the Hessian of $g$ with respect to $y$ is
\begin{equation}
\nabla_y^2 g(y) = A^T \nabla_x^2 f(Ay+b) A \in \reals^{n \times n}.
\end{equation}
Since $f$ is a convex function, $\nabla_x^2 f(x) \succeq 0$, i.e., it is a positive semidefinite matrix for all $x\in\reals^m$. Then for any $z\in\reals^n$,
\begin{equation}
z^T \nabla_y^2 g(y) z = z^T A^T \nabla_x^2 f(Ay+b) A z
= (Az)^T \nabla_x^2 f(Ay+b) (A z) \geq 0,
\end{equation}
hence $\nabla_y^2 g(y)$ is also a positive semidefinite matrix for all $y\in\reals^n$ (Wiki page for convex function).
Now the object function to be minimized for logistic regression is
\begin{equation}
\begin{array}{ll}
\mbox{minimize} &
L(\theta) = \sum_{i=1}^N \left( - y^i \log(\sigma(\theta^T x^i + \theta_0))
- (1-y^i) \log(1-\sigma(\theta^T x^i + \theta_0))
\right)
\end{array}
\end{equation}
where $(x^i, y^i)$ for $i=1,\ldots, N$ are $N$ training data. Now this is the sum of convex functions of linear (hence, affine) functions in $(\theta, \theta_0)$. Since the sum of convex functions is a convex function, this problem is a convex optimization.
Note that if it maximized the loss function, it would NOT be a convex optimization function. So the direction is critical!
Note also that, whether the algorithm we use is stochastic gradient descent, just gradient descent, or any other optimization algorithm, it solves the convex optimization problem, and that even if we use nonconvex nonlinear kernels for feature transformation, it is still a convex optimization problem since the loss function is still a convex function in $(\theta, \theta_0)$.
Now the new loss function proposed by the questioner is
\begin{equation}
L(\theta, \theta_0) = \sum_{i=1}^N \left( y^i (1-\sigma(\theta^T x^i + \theta_0))^2
+ (1-y^i) \sigma(\theta^T x^i + \theta_0)^2
\right)
\end{equation}
First we show that $f(z) = \sigma(z)^2$ is not a convex function in $z$. If we differentiate this function, we have
\begin{equation}
f'(z) = \frac{d}{dz} \sigma(z)^2 = 2 \sigma(z) \frac{d}{dz} \sigma(z)
= 2 \exp(-z) / (1+\exp(-z))^3.
\end{equation}
Since $f'(0)=1$ and $\lim_{z\to\infty} f'(z) = 0$ (and f'(z) is differentiable), the mean value theorem implies that there exists $z_0\geq0$ such that $f'(z_0) < 0$. Therefore $f(z)$ is NOT a convex function.
Now if we let $N=1$, $x^1 = 1$, $y^1 = 0$, $\theta_0=0$, and $\theta\in\reals$, $L(\theta, 0) = \sigma(\theta)^2$, hence $L(\theta,0)$ is not a convex function, hence the proof!
However, solving the non-convex optimization problem using gradient descent is not necessarily bad idea. (Almost) all deep learning problem is solved by stochastic gradient descent because it's the only way to solve it (other than evolutionary algorithms).
I hope this is a self-contained (strict) proof for the argument. Please leave feedback if anything is unclear or I made mistakes.
Thank you. - Sunghee
Best Answer
Your first question asks why $x_0$ should be $1$:
Let's look at the hypothesis, $h_\theta(x) = \theta_0 +\theta_1x_1$, where $x_1$ is the number of bedrooms in the house. We can clearly see here that the coefficient of $x_1$ is $\theta_1$. Now, what we could do, is rewrite the hypothesis like this: $h_\theta(x) = \theta_0x_0 +\theta_1x_1$, where $x_0 = 1$, and really, we wouldn't have changed a thing, since we're just multiplying $\theta_0$ by one.
OK, but why do this? Well --- it's just a trick to make our notation simpler. If we define the vector $\mathbf{x} = \begin{pmatrix}x_0\\x_1\end{pmatrix}$ and $\mathbf{\theta} = \begin{pmatrix}\theta_0\\\theta_1\end{pmatrix}$, then we can concisely write the hypothesis as: $h_\theta(x) = \mathbf{\theta^T}\mathbf{x}$. In the case of a 2 dimensional feature vector, this may not seem like a big deal, but once we start moving to higher and higher dimensions, this notation is a lot easier to deal with.
Your second question asks about initial values of $\theta$:
I want to clarify reason behind the cost function, and how gradient descent is used and hopefully this will clear up any confusion you have. Look at it like this: the ultimate goal here is to estimate the parameters $\theta_0$ and $\theta_1$. What we need is some way to measure "how good" these estimates are. One way we could do this is by minimizing the difference between the right answers, the $y^{(i)}$s, and our estimated values, the $h_\theta(x^{(i)})$s.
We do this by defining the cost function: $J(\theta) = \frac{1}{2}\sum\limits_{i = 1}^{m}(h_\theta(x^{(i)}) - y^{(i)})^2$. We wish to minimize this function $J(\theta)$ since it represents the sum of the squared errors our hypothesis has made. In other words, we wish to find values for our $\theta$s which minimize $J(\theta)$. The key thing to realize is that we want to find our $\theta$s.
Gradient descent is simply one way we can do this. I'm not going to go into detail about it, but I will mention that you can start with random $\theta$s. Using gradient descent, we iteratively reach values which minimize $J(\theta)$. There are a bunch of different flavours of gradient descent -- batch, minibatch and stochastic gradient descent, but they essentially do the same thing.
I don't want to confuse you, but gradient descent is simply one method we may use in order to find our parameters, $\mathbf{\theta}$. We could also use the normal equations, for example, to do so.