[Math] we update simultaneously all the variables in Gradient Descent

gradient descentoptimization

In the classic gradient descent algorithm, at each iteration step, we update all the variables simultaneously, i.e. $$\pmb{\theta}' \gets \pmb{\theta}-\mathbf{\alpha}\frac{\partial \mathbf{F}}{\partial \pmb{\theta}}$$

One alternative to this is that within each step we can update the variables as and when they are available.

For e.g. at each step: $$\pmb{\theta_1}' \gets \pmb{\theta_1}-\mathbf{\alpha}\frac{\partial \mathbf{F({\theta_1},\theta_2)}}{\partial \pmb{\theta_{1}}}$$
$$\pmb{\theta_2}' \gets \pmb{\theta_2}-\mathbf{\alpha}\frac{\partial \mathbf{F({\theta_1}',\theta_2)}}{\partial \pmb{\theta_{2}}}$$
I'm sure that this would also converge to the local optimum. So why is this alternate way of updation usually not the preferred way?

Edit: sometimes it makes sense not to update simultaneously. One use case would be that of training Neural Networks in NLP. Usually, we use Gradient Descent here but without the simultaneous updating because simultaneous updating from all the training examples takes a lot of time. Refere pg 33 of this pdf

Best Answer

A simple example, let $f = \sin(\sum_{i=1}^n \alpha_i \theta_i)$. To compute all derivatives at a point you only have to evaluate $\sin$ once. If you cycle through all variables, you will have to evaluate $\sin$ $n$ times as the argument changes. Most often, it pays off to do steps in all coordinates at the same time. A simple analogy would be walking. You typically don't walk east-west direction first, and then north-south. You walk the shortest direction, i.e., move in both coordinates simultaneously.

Related Solutions

[Math] Partial derivative in gradient descent for two variables

The answer above is a good one, but I thought I'd add in some more "layman's" terms that helped me better understand concepts of partial derivatives. The answers I've seen here and in the Coursera forums leave out talking about the chain rule, which is important to know if you're going to get what this is doing...

It's helpful for me to think of partial derivatives this way: the variable you're focusing on is treated as a variable, the other terms just numbers. Other key concepts that are helpful:

For "regular derivatives" of a simple form like $F(x) = cx^n$ , the derivative is simply $F'(x) = cn \times x^{n-1}$
The derivative of a constant (a number) is 0.
Summations are just passed on in derivatives; they don't affect the derivative. Just copy them down in place as you derive.

Also, it should be mentioned that the chain rule is being used. The chain rule says that (in clunky laymans terms), for $g(f(x))$, you take the derivative of $g(f(x))$, treating $f(x)$ as the variable, and then multiply by the derivative of $f(x)$. For our cost function, think of it this way:

$$ g(\theta_0, \theta_1) = \frac{1}{2m} \sum_{i=1}^m \left(f(\theta_0, \theta_1)^{(i)}\right)^2 \tag{1}$$

$$ f(\theta_0, \theta_1)^{(i)} = \theta_0 + \theta_{1}x^{(i)} - y^{(i)} \tag{2}$$

To show I'm not pulling funny business, sub in the definition of $f(\theta_0, \theta_1)^{(i)}$ into the definition of $g(\theta_0, \theta_1)$ and you get:

$$ g(f(\theta_0, \theta_1)^{(i)}) = \frac{1}{2m} \sum_{i=1}^m \left(\theta_0 + \theta_{1}x^{(i)} - y^{(i)}\right)^2 \tag{3}$$

This is, indeed, our entire cost function.

Thus, the partial derivatives work like this:

$$ \frac{\partial}{\partial \theta_0} g(\theta_0, \theta_1) = \frac{\partial}{\partial \theta_0} \frac{1}{2m} \sum_{i=1}^m \left(f(\theta_0, \theta_1)^{(i)}\right)^2 = 2 \times \frac{1}{2m} \sum_{i=1}^m \left(f(\theta_0, \theta_1)^{(i)}\right)^{2-1} = \tag{4}$$

$$\frac{1}{m} \sum_{i=1}^m f(\theta_0, \theta_1)^{(i)}$$

In other words, just treat $f(\theta_0, \theta_1)^{(i)}$ like a variable and you have a simple derivative of $\frac{1}{2m} x^2 = \frac{1}{m}x$

$$ \frac{\partial}{\partial \theta_0} f(\theta_0, \theta_1)^{(i)} = \frac{\partial}{\partial \theta_0} (\theta_0 + \theta_{1}x^{(i)} - y^{(i)}) \tag{5}$$

And $\theta_1, x$, and $y$ are just "a number" since we're taking the derivative with respect to $\theta_0$, so the partial of $g(\theta_0, \theta_1)$ becomes:

$$ \frac{\partial}{\partial \theta_0} f(\theta_0, \theta_1) = \frac{\partial}{\partial \theta_0} (\theta_0 + [a \ number][a \ number]^{(i)} - [a \ number]^{(i)}) = \frac{\partial}{\partial \theta_0} \theta_0 = 1 \tag{6}$$

So, using the chain rule, we have:

$$ \frac{\partial}{\partial \theta_0} g(f(\theta_0, \theta_1)^{(i)}) = \frac{\partial}{\partial \theta_0} g(\theta_0, \theta_1) \frac{\partial}{\partial \theta_0}f(\theta_0, \theta_1)^{(i)} \tag{7}$$

And subbing in the partials of $g(\theta_0, \theta_1)$ and $f(\theta_0, \theta_1)^{(i)}$ from above, we have:

$$ \frac{1}{m} \sum_{i=1}^m f(\theta_0, \theta_1)^{(i)} \frac{\partial}{\partial \theta_0}f(\theta_0, \theta_1)^{(i)} = \frac{1}{m} \sum_{i=1}^m \left(\theta_0 + \theta_{1}x^{(i)} - y^{(i)}\right) \times 1 = \tag{8}$$

$$ \frac{1}{m} \sum_{i=1}^m \left(\theta_0 + \theta_{1}x^{(i)} - y^{(i)}\right)$$

What about the derivative with respect to $\theta_1$?

Our term $g(\theta_0, \theta_1)$ is identical, so we just need to take the derivative of $f(\theta_0, \theta_1)^{(i)}$, this time treating $\theta_1$ as the variable and the other terms as "just a number." That goes like this:

$$ \frac{\partial}{\partial \theta_1} f(\theta_0, \theta_1)^{(i)} = \frac{\partial}{\partial \theta_1} (\theta_0 + \theta_{1}x^{(i)} - y^{(i)}) \tag{9}$$

$$ \frac{\partial}{\partial \theta_1} f(\theta_0, \theta_1)^{(i)} = \frac{\partial}{\partial \theta_1} ([a \ number] + \theta_{1}[a \ number, x^{(i)}] - [a \ number]) \tag{10}$$

Note that the "just a number", $x^{(i)}$, is important in this case because the derivative of $c \times x$ (where $c$ is some number) is $\frac{d}{dx}(c \times x^1) = c \times 1 \times x^{(1-1=0)} = c \times 1 \times 1 = c$, so the number will carry through. In this case that number is $x^{(i)}$ so we need to keep it. Thus, our derivative is:

$$ \frac{\partial}{\partial \theta_1} f(\theta_0, \theta_1)^{(i)} = 0 + (\theta_{1})^1 x^{(i)} - 0 = 1 \times \theta_1^{(1-1=0)} x^{(i)} = 1 \times 1 \times x^{(i)} = x^{(i)} \tag{11}$$

Thus, the entire answer becomes:

$$ \frac{\partial}{\partial \theta_1} g(f(\theta_0, \theta_1)^{(i)}) = \frac{\partial}{\partial \theta_1} g(\theta_0, \theta_1) \frac{\partial}{\partial \theta_1} f(\theta_0, \theta_1)^{(i)} = \tag{12}$$

$$\frac{1}{m} \sum_{i=1}^m f(\theta_0, \theta_1)^{(i)} \frac{\partial}{\partial \theta_1}f(\theta_0, \theta_1)^{(i)} = \frac{1}{m} \sum_{i=1}^m \left(\theta_0 + \theta_{1}x^{(i)} - y^{(i)}\right) x^{(i)}$$

A quick addition per @Hugo's comment below. Let's ignore the fact that we're dealing with vectors at all, which drops the summation and $fu^{(i)}$ bits. We can also more easily use real numbers this way.

$\require{cancel}$

Let's say $x = 2$ and $y = 4$.

So, for part 1 you have:

$$\frac{\partial}{\partial \theta_0} (\theta_0 + \theta_{1}x - y)$$

Filling in the values for $x$ and $y$, we have:

$$\frac{\partial}{\partial \theta_0} (\theta_0 + 2\theta_{1} - 4)$$

We only care about $\theta_0$, so $\theta_1$ is treated like a constant (any number, so let's just say it's 6).

$$\frac{\partial}{\partial \theta_0} (\theta_0 + (2 \times 6) - 4) = \frac{\partial}{\partial \theta_0} (\theta_0 + \cancel8) = 1$$

Using the same values, let's look at the $\theta_1$ case (same starting point with $x$ and $y$ values input):

$$\frac{\partial}{\partial \theta_1} (\theta_0 + 2\theta_{1} - 4)$$

In this case we do care about $\theta_1$, but $\theta_0$ is treated as a constant; we'll do the same as above and use 6 for it's value:

$$\frac{\partial}{\partial \theta_1} (6 + 2\theta_{1} - 4) = \frac{\partial}{\partial \theta_1} (2\theta_{1} + \cancel2) = 2 = x$$

The answer is 2 because we ended up with $2\theta_1$ and we had that because $x = 2$.

Hopefully the clarifies a bit on why in the first instance (wrt $\theta_0$) I wrote "just a number," and in the second case (wrt $\theta_1$) I wrote "just a number, $x^{(i)}$. While it's true that $x^{(i)}$ is still "just a number", since it's attached to the variable of interest in the second case it's value will carry through which is why we end up at $x^{(i)}$ for the result.

[Math] Implementing gradient descent based on formula

Firstly, let's make the convention that $x = x_1$ and $x_1^{(i)}$ expresses the value of $x_1$ at the $i-$ row (see example) or equivalently $x_1^{(i)}$ is the value of $x_1$ of the $i-$th training example. We notice that $x^{(1)}_1 = 1.$ We denote $^{k}\theta_j$ the value of $\theta_ j$ after $k$ updates (i.e. after $k$ repetitions of the algorithm). After one update we have: $$ \ ^{1}\theta_{0}:= \, ^{0}\theta_{0} - \frac 14 \bigg[\left(^{0}\theta_0+\,^{0}\theta_1x_1^{(1)}-y^{(1)}\right)\cdot x_0^{(1)} + \left(^{0}\theta_0+\,^{0}\theta_1x_1^{(2)}-2\right)\cdot x_0^{(1)} + \left(^{0}\theta_0+\,^{0}\theta_1x_1^{(3)}-y^{(3)}\right)\cdot x_0^{(1)} + \left(^{0}\theta_0+\,^{0}\theta_1x_1^{(4)}-y^{(4)}\right)\cdot x_0^{(1)} \bigg]= 0, $$ since $^{0}\theta_0 = 0,$ $^{0}\theta_1 = 1$ and $x_0^{(i)} = 1.$ $$ \ ^{1}\theta_{1}:= \, ^{0}\theta_{1} - \frac 14 \bigg[\left(^{0}\theta_0+\,^{0}\theta_1x_1^{(1)}-y^{(1)}\right)\cdot x_1^{(1)} + \left(^{0}\theta_0+\,^{0}\theta_1x_1^{(2)}-2\right)\cdot x_1^{(1)} + \left(^{0}\theta_0+\,^{0}\theta_1x_1^{(3)}-y^{(3)}\right)\cdot x_1^{(1)} + \left(^{0}\theta_0+\,^{0}\theta_1x_1^{(4)}-y^{(4)}\right)\cdot x_1^{(1)} \bigg]= 1. $$

Thus, $^{1}\theta_1 = 1$ and $^{1}\theta_0 = 0.$ Notice that if we want to proceed to the next iteration, we need both $\theta_0$ and $\theta_1$ which we found at the previous step. So: $$^{2}\theta_1 = : \, ^{1}\theta_{1} - \frac 14 \bigg[\left(^{1}\theta_1x_1^{(1)}-y^{(1)}\right)\cdot x_1^{(1)} + \left(^{1}\theta_0+\,^{1}\theta_1x_1^{(2)}-2\right)\cdot x_1^{(1)} + \left(^{1}\theta_0+\,^{1}\theta_1x_1^{(3)}-y^{(3)}\right)\cdot x_1^{(1)} + \left(^{1}\theta_0+\,^{1}\theta_1x_1^{(4)}-y^{(4)}\right)\cdot x_1^{(1)} \bigg]= 1$$

So, regardless how many updates we apply, the value of $\theta_1$ will be constantly equal to $1,$ since at every iteration we have $\theta_0 = 0$ and $\theta_1 = 1.$

About update 2: Here is What I would do if I were in your shoes. First of all, I would calculate separately $h_\theta(x^{(1)})$ and $h_\theta(x^{(2)}),$ where $\theta $ is our initial vector $\theta=[ 1 \quad 3 \quad 2 \quad 1]^T.$

We have:

$$h_\theta(x^{(1)})= 1\cdot 1 + 3\cdot 1 +2\cdot 2 + 1\cdot 1 = 9 $$

$$h_\theta(x^{(2)})= 1\cdot 1 + 3\cdot 4 + 2\cdot 2 + 1\cdot 5 = 22. $$

Thus, if we implement the algorithm, we get: $$\begin{array}[t]{l} \theta_0 = 1-\frac{0.5}{2}\cdot \left[(9-1)\cdot 1 +(22-2)\cdot 1 \right]=-6\\ \theta_1 = 3 - \frac{0.5}{2}\cdot\left[(9-1)\cdot 1 + (22-2)\cdot 4\right]=-19\\ \theta_3 = 2 -\frac{0.5}{2}\cdot \left[(9-1)\cdot 2+(22-2)\cdot 2\right]=-12\\ \theta_4 = 1-\frac{0.5}{2}\cdot \left[(9-1)\cdot 1+(22-2)\cdot 5\right]=-26 \end{array} $$

Thus, after one update the new $\theta=[-6 \quad -19 \quad -12 \quad -26]^T.$

If you want to apply the algorithm once again, evaluate the new $h_\theta(x^{(1)})$ and $h_\theta(x^{(2)})$ and proceed as before.

Notice that the prof uses the convention: $$\begin{array}[t]{c | c | c} x_0 & x_1 & x_2 & x_3\\ \hline x_0^{(1)} = 1 & x_1^{(1)} = 1 &x_2^{(1)} = 2 & x_3^{(1)} = 1\\ x_0^{(2)} = 1 & x_1^{(2)}=4 & x_2^{(2)} = 2 & x_3^{(2)} =5 \end{array} $$

Best Answer

Related Solutions

[Math] Partial derivative in gradient descent for two variables

[Math] Implementing gradient descent based on formula

Related Question