[Math] Partial derivative in gradient descent for logistic regression

gradient descentpartial derivative

I post this question here because I think this is a calculus problem

I'm a software engineer, and I have just started a Udacity's nanodegree of deep learning.

I have also worked my way through Stanford professor Andrew Ng's online course on machine learning and now I'm comparing.

I have a big doubt about Gradient Descent with sigmoid function because on Andrew Ng's course it is different from the one I see on Udacity's nanodegree.

From Andrew Ng's course, gradient descent is (First formula):

But, from Udacity's nanodegree is (Second formula):

Note: first picture is from this video, and second picture is for this other video.

But in this CS229 course notes from Andrew Ng's, on page 18, I have found the demonstration from Andrew Ng's gradient ascent formula. I only add it here as a demonstration because I haven't found this derivative process for gradient descent and I don't know how to do it:

Note: the formula above is for Gradient Ascent.

I'm not sure if I have understood everything, but in this derivative I see have the derivative from f function disappears (f function is the sigmoid function).

But in Udacity's nanodegree they continue using the sigmoid's derivative in their gradient descent.

The difference between first formula and second formula is the derivative term.

Are the two formula equivalents?
Where can I find all partial derivative steps for that partial derivative?

Best Answer

Please take a look at this part of Machine learning course on Coursera which can help you with your question: https://www.coursera.org/learn/machine-learning/lecture/MtEaZ/simplified-cost-function-and-gradient-descent In this part, the lecturer is showing the result of derivative in gradient descent for logistic regression.

Related Solutions

[Math] Partial derivative in gradient descent for two variables

The answer above is a good one, but I thought I'd add in some more "layman's" terms that helped me better understand concepts of partial derivatives. The answers I've seen here and in the Coursera forums leave out talking about the chain rule, which is important to know if you're going to get what this is doing...

It's helpful for me to think of partial derivatives this way: the variable you're focusing on is treated as a variable, the other terms just numbers. Other key concepts that are helpful:

For "regular derivatives" of a simple form like $F(x) = cx^n$ , the derivative is simply $F'(x) = cn \times x^{n-1}$
The derivative of a constant (a number) is 0.
Summations are just passed on in derivatives; they don't affect the derivative. Just copy them down in place as you derive.

Also, it should be mentioned that the chain rule is being used. The chain rule says that (in clunky laymans terms), for $g(f(x))$, you take the derivative of $g(f(x))$, treating $f(x)$ as the variable, and then multiply by the derivative of $f(x)$. For our cost function, think of it this way:

$$ g(\theta_0, \theta_1) = \frac{1}{2m} \sum_{i=1}^m \left(f(\theta_0, \theta_1)^{(i)}\right)^2 \tag{1}$$

$$ f(\theta_0, \theta_1)^{(i)} = \theta_0 + \theta_{1}x^{(i)} - y^{(i)} \tag{2}$$

To show I'm not pulling funny business, sub in the definition of $f(\theta_0, \theta_1)^{(i)}$ into the definition of $g(\theta_0, \theta_1)$ and you get:

$$ g(f(\theta_0, \theta_1)^{(i)}) = \frac{1}{2m} \sum_{i=1}^m \left(\theta_0 + \theta_{1}x^{(i)} - y^{(i)}\right)^2 \tag{3}$$

This is, indeed, our entire cost function.

Thus, the partial derivatives work like this:

$$ \frac{\partial}{\partial \theta_0} g(\theta_0, \theta_1) = \frac{\partial}{\partial \theta_0} \frac{1}{2m} \sum_{i=1}^m \left(f(\theta_0, \theta_1)^{(i)}\right)^2 = 2 \times \frac{1}{2m} \sum_{i=1}^m \left(f(\theta_0, \theta_1)^{(i)}\right)^{2-1} = \tag{4}$$

$$\frac{1}{m} \sum_{i=1}^m f(\theta_0, \theta_1)^{(i)}$$

In other words, just treat $f(\theta_0, \theta_1)^{(i)}$ like a variable and you have a simple derivative of $\frac{1}{2m} x^2 = \frac{1}{m}x$

$$ \frac{\partial}{\partial \theta_0} f(\theta_0, \theta_1)^{(i)} = \frac{\partial}{\partial \theta_0} (\theta_0 + \theta_{1}x^{(i)} - y^{(i)}) \tag{5}$$

And $\theta_1, x$, and $y$ are just "a number" since we're taking the derivative with respect to $\theta_0$, so the partial of $g(\theta_0, \theta_1)$ becomes:

$$ \frac{\partial}{\partial \theta_0} f(\theta_0, \theta_1) = \frac{\partial}{\partial \theta_0} (\theta_0 + [a \ number][a \ number]^{(i)} - [a \ number]^{(i)}) = \frac{\partial}{\partial \theta_0} \theta_0 = 1 \tag{6}$$

So, using the chain rule, we have:

$$ \frac{\partial}{\partial \theta_0} g(f(\theta_0, \theta_1)^{(i)}) = \frac{\partial}{\partial \theta_0} g(\theta_0, \theta_1) \frac{\partial}{\partial \theta_0}f(\theta_0, \theta_1)^{(i)} \tag{7}$$

And subbing in the partials of $g(\theta_0, \theta_1)$ and $f(\theta_0, \theta_1)^{(i)}$ from above, we have:

$$ \frac{1}{m} \sum_{i=1}^m f(\theta_0, \theta_1)^{(i)} \frac{\partial}{\partial \theta_0}f(\theta_0, \theta_1)^{(i)} = \frac{1}{m} \sum_{i=1}^m \left(\theta_0 + \theta_{1}x^{(i)} - y^{(i)}\right) \times 1 = \tag{8}$$

$$ \frac{1}{m} \sum_{i=1}^m \left(\theta_0 + \theta_{1}x^{(i)} - y^{(i)}\right)$$

What about the derivative with respect to $\theta_1$?

Our term $g(\theta_0, \theta_1)$ is identical, so we just need to take the derivative of $f(\theta_0, \theta_1)^{(i)}$, this time treating $\theta_1$ as the variable and the other terms as "just a number." That goes like this:

$$ \frac{\partial}{\partial \theta_1} f(\theta_0, \theta_1)^{(i)} = \frac{\partial}{\partial \theta_1} (\theta_0 + \theta_{1}x^{(i)} - y^{(i)}) \tag{9}$$

$$ \frac{\partial}{\partial \theta_1} f(\theta_0, \theta_1)^{(i)} = \frac{\partial}{\partial \theta_1} ([a \ number] + \theta_{1}[a \ number, x^{(i)}] - [a \ number]) \tag{10}$$

Note that the "just a number", $x^{(i)}$, is important in this case because the derivative of $c \times x$ (where $c$ is some number) is $\frac{d}{dx}(c \times x^1) = c \times 1 \times x^{(1-1=0)} = c \times 1 \times 1 = c$, so the number will carry through. In this case that number is $x^{(i)}$ so we need to keep it. Thus, our derivative is:

$$ \frac{\partial}{\partial \theta_1} f(\theta_0, \theta_1)^{(i)} = 0 + (\theta_{1})^1 x^{(i)} - 0 = 1 \times \theta_1^{(1-1=0)} x^{(i)} = 1 \times 1 \times x^{(i)} = x^{(i)} \tag{11}$$

Thus, the entire answer becomes:

$$ \frac{\partial}{\partial \theta_1} g(f(\theta_0, \theta_1)^{(i)}) = \frac{\partial}{\partial \theta_1} g(\theta_0, \theta_1) \frac{\partial}{\partial \theta_1} f(\theta_0, \theta_1)^{(i)} = \tag{12}$$

$$\frac{1}{m} \sum_{i=1}^m f(\theta_0, \theta_1)^{(i)} \frac{\partial}{\partial \theta_1}f(\theta_0, \theta_1)^{(i)} = \frac{1}{m} \sum_{i=1}^m \left(\theta_0 + \theta_{1}x^{(i)} - y^{(i)}\right) x^{(i)}$$

A quick addition per @Hugo's comment below. Let's ignore the fact that we're dealing with vectors at all, which drops the summation and $fu^{(i)}$ bits. We can also more easily use real numbers this way.

$\require{cancel}$

Let's say $x = 2$ and $y = 4$.

So, for part 1 you have:

$$\frac{\partial}{\partial \theta_0} (\theta_0 + \theta_{1}x - y)$$

Filling in the values for $x$ and $y$, we have:

$$\frac{\partial}{\partial \theta_0} (\theta_0 + 2\theta_{1} - 4)$$

We only care about $\theta_0$, so $\theta_1$ is treated like a constant (any number, so let's just say it's 6).

$$\frac{\partial}{\partial \theta_0} (\theta_0 + (2 \times 6) - 4) = \frac{\partial}{\partial \theta_0} (\theta_0 + \cancel8) = 1$$

Using the same values, let's look at the $\theta_1$ case (same starting point with $x$ and $y$ values input):

$$\frac{\partial}{\partial \theta_1} (\theta_0 + 2\theta_{1} - 4)$$

In this case we do care about $\theta_1$, but $\theta_0$ is treated as a constant; we'll do the same as above and use 6 for it's value:

$$\frac{\partial}{\partial \theta_1} (6 + 2\theta_{1} - 4) = \frac{\partial}{\partial \theta_1} (2\theta_{1} + \cancel2) = 2 = x$$

The answer is 2 because we ended up with $2\theta_1$ and we had that because $x = 2$.

Hopefully the clarifies a bit on why in the first instance (wrt $\theta_0$) I wrote "just a number," and in the second case (wrt $\theta_1$) I wrote "just a number, $x^{(i)}$. While it's true that $x^{(i)}$ is still "just a number", since it's attached to the variable of interest in the second case it's value will carry through which is why we end up at $x^{(i)}$ for the result.

[Math] Feature scaling’s effect on gradient descent

The gradient descend uses one fixed learning rate for all $\theta$'s, so we need to choose the value based on the input value having the smallest range. Otherwise the gradient descent might not converge for that small range. Now with that small learning rate it takes ages for the large range to converge.

There is also good explanation in Quora

Essentially, scaling the inputs (through mean normalization, or z-score) gives the error surface a more spherical shape, where it would otherwise be a very high curvature ellipse. Since gradient descent is curvature-ignorant, having an error surface with high curvature will mean that we take many steps which aren't necessarily in the optimal direction. When we scale the inputs, we reduce the curvature, which makes methods that ignore curvature (like gradient descent) work much better. When the error surface is circular (spherical), the gradient points right at the minimum, so learning is easy.

Best Answer

Related Solutions

[Math] Partial derivative in gradient descent for two variables

[Math] Feature scaling’s effect on gradient descent

Related Question