Gradient Descent: Minimising the Directional Derivative in Direction $\mathbf{u}$

gradient descentmultivariable-calculusoptimizationpartial derivativevector analysis

The following excerpt is from chapter 4.3 of Deep Learning, by Goodfellow, Bengio, and Courville:

I don't understand the following:

What is meant by $\mathbf{u}, \mathbf{u}^T \mathbf{u} = 1$ in $\min_\limits{\mathbf{u}, \mathbf{u}^T \mathbf{u} = 1}$?
Why is $\min_\limits{\mathbf{u}, \mathbf{u}^T \mathbf{u} = 1} \mathbf{u}^T \nabla_{\mathbf{x}} f(\mathbf{x}) = \min_\limits{\mathbf{u}, \mathbf{u}^T \mathbf{u} = 1} ||\mathbf{u}||_2 || \nabla_{\mathbf{x}} f(\mathbf{x})||_2 \cos(\theta)$? I have no idea how the components of the latter expression came about.
The authors states that the factors that do not depend on $\mathbf{u}$ are ignored. But they then state that the expression simplifies to $\min_{\mathbf{u}} \cos(\theta)$, but $\cos(\theta)$ depends on $\theta$ — not $\mathbf{u}$?
I'm not sure that I understand what is meant by the explanation immediately following the above, but this could be due to my not understanding the preceding information.

I would greatly appreciate it if people could please take the time to clarify this.

Best Answer

u is a unit vector in the direction that you want to evaluate the slope. That is why uTu=1 in the minimization.
The statement in your second question is simply the dot product between the u vector and the gradient vector (del f), which is always the two lengths times the cosine of the included angle theta.
One can ignore the two magnitudes because they are fixed values independent of direction, and it is the relative directions of the two vectors that define theta.

Related Solutions

[Math] Partial derivative in gradient descent for two variables

The answer above is a good one, but I thought I'd add in some more "layman's" terms that helped me better understand concepts of partial derivatives. The answers I've seen here and in the Coursera forums leave out talking about the chain rule, which is important to know if you're going to get what this is doing...

It's helpful for me to think of partial derivatives this way: the variable you're focusing on is treated as a variable, the other terms just numbers. Other key concepts that are helpful:

For "regular derivatives" of a simple form like $F(x) = cx^n$ , the derivative is simply $F'(x) = cn \times x^{n-1}$
The derivative of a constant (a number) is 0.
Summations are just passed on in derivatives; they don't affect the derivative. Just copy them down in place as you derive.

Also, it should be mentioned that the chain rule is being used. The chain rule says that (in clunky laymans terms), for $g(f(x))$, you take the derivative of $g(f(x))$, treating $f(x)$ as the variable, and then multiply by the derivative of $f(x)$. For our cost function, think of it this way:

$$ g(\theta_0, \theta_1) = \frac{1}{2m} \sum_{i=1}^m \left(f(\theta_0, \theta_1)^{(i)}\right)^2 \tag{1}$$

$$ f(\theta_0, \theta_1)^{(i)} = \theta_0 + \theta_{1}x^{(i)} - y^{(i)} \tag{2}$$

To show I'm not pulling funny business, sub in the definition of $f(\theta_0, \theta_1)^{(i)}$ into the definition of $g(\theta_0, \theta_1)$ and you get:

$$ g(f(\theta_0, \theta_1)^{(i)}) = \frac{1}{2m} \sum_{i=1}^m \left(\theta_0 + \theta_{1}x^{(i)} - y^{(i)}\right)^2 \tag{3}$$

This is, indeed, our entire cost function.

Thus, the partial derivatives work like this:

$$ \frac{\partial}{\partial \theta_0} g(\theta_0, \theta_1) = \frac{\partial}{\partial \theta_0} \frac{1}{2m} \sum_{i=1}^m \left(f(\theta_0, \theta_1)^{(i)}\right)^2 = 2 \times \frac{1}{2m} \sum_{i=1}^m \left(f(\theta_0, \theta_1)^{(i)}\right)^{2-1} = \tag{4}$$

$$\frac{1}{m} \sum_{i=1}^m f(\theta_0, \theta_1)^{(i)}$$

In other words, just treat $f(\theta_0, \theta_1)^{(i)}$ like a variable and you have a simple derivative of $\frac{1}{2m} x^2 = \frac{1}{m}x$

$$ \frac{\partial}{\partial \theta_0} f(\theta_0, \theta_1)^{(i)} = \frac{\partial}{\partial \theta_0} (\theta_0 + \theta_{1}x^{(i)} - y^{(i)}) \tag{5}$$

And $\theta_1, x$, and $y$ are just "a number" since we're taking the derivative with respect to $\theta_0$, so the partial of $g(\theta_0, \theta_1)$ becomes:

$$ \frac{\partial}{\partial \theta_0} f(\theta_0, \theta_1) = \frac{\partial}{\partial \theta_0} (\theta_0 + [a \ number][a \ number]^{(i)} - [a \ number]^{(i)}) = \frac{\partial}{\partial \theta_0} \theta_0 = 1 \tag{6}$$

So, using the chain rule, we have:

$$ \frac{\partial}{\partial \theta_0} g(f(\theta_0, \theta_1)^{(i)}) = \frac{\partial}{\partial \theta_0} g(\theta_0, \theta_1) \frac{\partial}{\partial \theta_0}f(\theta_0, \theta_1)^{(i)} \tag{7}$$

And subbing in the partials of $g(\theta_0, \theta_1)$ and $f(\theta_0, \theta_1)^{(i)}$ from above, we have:

$$ \frac{1}{m} \sum_{i=1}^m f(\theta_0, \theta_1)^{(i)} \frac{\partial}{\partial \theta_0}f(\theta_0, \theta_1)^{(i)} = \frac{1}{m} \sum_{i=1}^m \left(\theta_0 + \theta_{1}x^{(i)} - y^{(i)}\right) \times 1 = \tag{8}$$

$$ \frac{1}{m} \sum_{i=1}^m \left(\theta_0 + \theta_{1}x^{(i)} - y^{(i)}\right)$$

What about the derivative with respect to $\theta_1$?

Our term $g(\theta_0, \theta_1)$ is identical, so we just need to take the derivative of $f(\theta_0, \theta_1)^{(i)}$, this time treating $\theta_1$ as the variable and the other terms as "just a number." That goes like this:

$$ \frac{\partial}{\partial \theta_1} f(\theta_0, \theta_1)^{(i)} = \frac{\partial}{\partial \theta_1} (\theta_0 + \theta_{1}x^{(i)} - y^{(i)}) \tag{9}$$

$$ \frac{\partial}{\partial \theta_1} f(\theta_0, \theta_1)^{(i)} = \frac{\partial}{\partial \theta_1} ([a \ number] + \theta_{1}[a \ number, x^{(i)}] - [a \ number]) \tag{10}$$

Note that the "just a number", $x^{(i)}$, is important in this case because the derivative of $c \times x$ (where $c$ is some number) is $\frac{d}{dx}(c \times x^1) = c \times 1 \times x^{(1-1=0)} = c \times 1 \times 1 = c$, so the number will carry through. In this case that number is $x^{(i)}$ so we need to keep it. Thus, our derivative is:

$$ \frac{\partial}{\partial \theta_1} f(\theta_0, \theta_1)^{(i)} = 0 + (\theta_{1})^1 x^{(i)} - 0 = 1 \times \theta_1^{(1-1=0)} x^{(i)} = 1 \times 1 \times x^{(i)} = x^{(i)} \tag{11}$$

Thus, the entire answer becomes:

$$ \frac{\partial}{\partial \theta_1} g(f(\theta_0, \theta_1)^{(i)}) = \frac{\partial}{\partial \theta_1} g(\theta_0, \theta_1) \frac{\partial}{\partial \theta_1} f(\theta_0, \theta_1)^{(i)} = \tag{12}$$

$$\frac{1}{m} \sum_{i=1}^m f(\theta_0, \theta_1)^{(i)} \frac{\partial}{\partial \theta_1}f(\theta_0, \theta_1)^{(i)} = \frac{1}{m} \sum_{i=1}^m \left(\theta_0 + \theta_{1}x^{(i)} - y^{(i)}\right) x^{(i)}$$

A quick addition per @Hugo's comment below. Let's ignore the fact that we're dealing with vectors at all, which drops the summation and $fu^{(i)}$ bits. We can also more easily use real numbers this way.

$\require{cancel}$

Let's say $x = 2$ and $y = 4$.

So, for part 1 you have:

$$\frac{\partial}{\partial \theta_0} (\theta_0 + \theta_{1}x - y)$$

Filling in the values for $x$ and $y$, we have:

$$\frac{\partial}{\partial \theta_0} (\theta_0 + 2\theta_{1} - 4)$$

We only care about $\theta_0$, so $\theta_1$ is treated like a constant (any number, so let's just say it's 6).

$$\frac{\partial}{\partial \theta_0} (\theta_0 + (2 \times 6) - 4) = \frac{\partial}{\partial \theta_0} (\theta_0 + \cancel8) = 1$$

Using the same values, let's look at the $\theta_1$ case (same starting point with $x$ and $y$ values input):

$$\frac{\partial}{\partial \theta_1} (\theta_0 + 2\theta_{1} - 4)$$

In this case we do care about $\theta_1$, but $\theta_0$ is treated as a constant; we'll do the same as above and use 6 for it's value:

$$\frac{\partial}{\partial \theta_1} (6 + 2\theta_{1} - 4) = \frac{\partial}{\partial \theta_1} (2\theta_{1} + \cancel2) = 2 = x$$

The answer is 2 because we ended up with $2\theta_1$ and we had that because $x = 2$.

Hopefully the clarifies a bit on why in the first instance (wrt $\theta_0$) I wrote "just a number," and in the second case (wrt $\theta_1$) I wrote "just a number, $x^{(i)}$. While it's true that $x^{(i)}$ is still "just a number", since it's attached to the variable of interest in the second case it's value will carry through which is why we end up at $x^{(i)}$ for the result.

Gradient/Steepest Descent: Solving for a Step Size That Makes the Directional Derivative Vanish

First, you're right, "to vanish" means "to be(come) zero".

You seem to be confusing the gradient and the directional derivative.

The gradient $\nabla_\mathbf xf(\mathbf x)$ is a property of the function $f$ at the point $\mathbf x$. (The argument $\mathbf x$ in parentheses specifies the point $\mathbf x$ at which the gradient is taken, whereas the subscript $\mathbf x$ on the nabla operator specifies the variable $\mathbf x$ with respect to which the gradient is taken.)

The directional derivative $\frac{\partial f(\mathbf x)}{\partial \mathbf n}$ is the derivative of the function $f(\mathbf x)$ along the direction specified by a unit vector $\mathbf n$. It's defined by

$$ \frac{\partial f(\mathbf x)}{\partial \mathbf n}=\lim_{\epsilon\to0}\frac{f(\mathbf x+\epsilon\mathbf n)-f(\mathbf x)}\epsilon\;. $$

The connection between the two is that (under suitable differentiability conditions)

$$\frac{\partial f(\mathbf x)}{\partial \mathbf n}=\mathbf n\cdot\nabla_\mathbf xf(\mathbf x)\;.$$

Since the directional derivative is the scalar product of the direction vector and the gradient, the directional derivative is greatest in the direction of the gradient. With the unit vector $\mathbf g=\frac{\nabla_\mathbf xf(\mathbf x)}{\|\nabla_\mathbf xf(\mathbf x)\|}$, we have

$$\frac{\partial f(\mathbf x)}{\partial \mathbf g}=\mathbf g\cdot\nabla_\mathbf xf(\mathbf x)=\frac{\nabla_\mathbf xf(\mathbf x)\cdot\nabla_\mathbf xf(\mathbf x)}{\|\nabla_\mathbf xf(\mathbf x)\|}=\|\nabla_\mathbf xf(\mathbf x)\|\;.$$

The text you quote isn't saying that you can choose the step size such as to make the gradient vanish, but rather such as to make the directional derivative vanish. It doesn't say which directional derivative, but it's implied that it's the directional derivative along the search direction. Think of this as setting off in some direction in the mountains and walking downhill straight ahead until you reach a point where you'd start going up again if you continued in that direction – the bottom of the valley, if you will. This isn't the lowest point yet, since you may be able to descend further by changing direction, but you've optimized the height with respect to the direction that you set off in, and at the point that you've reached, the path in that direction is horizontal – even if it leads downhill if you change direction.

Best Answer

Related Solutions

[Math] Partial derivative in gradient descent for two variables

Gradient/Steepest Descent: Solving for a Step Size That Makes the Directional Derivative Vanish

Related Question