We won't learn derivations in school for next 3 or 4 years, but I got interested in math and started learning about derivations on Khan Academy. I do understand what would a derivative of $f(x) = x^2$ be, but what if $f(x, y) = (x + y)(x + y)$ ? I guess i would need a three dimensional graph for that, and thus I wouldn't be looking for a tangent line, but more like a tangent surface of some sort? How can I express the "slope" of a surface? I guess that I need at least two numbers… I did some calculations, I'm pretty sure that they're wrong but I got that $f'(x, y) = 2x + 2y$, if given that $f(x, y) = (x + y)(x + y)$… This doesn't really make any sense to me, at all, so how do you do it?
[Math] How to find the derivative of a function that takes in two variables
calculusmultivariable-calculus
Related Solutions
The answer above is a good one, but I thought I'd add in some more "layman's" terms that helped me better understand concepts of partial derivatives. The answers I've seen here and in the Coursera forums leave out talking about the chain rule, which is important to know if you're going to get what this is doing...
It's helpful for me to think of partial derivatives this way: the variable you're focusing on is treated as a variable, the other terms just numbers. Other key concepts that are helpful:
- For "regular derivatives" of a simple form like $F(x) = cx^n$ , the derivative is simply $F'(x) = cn \times x^{n-1}$
- The derivative of a constant (a number) is 0.
- Summations are just passed on in derivatives; they don't affect the derivative. Just copy them down in place as you derive.
Also, it should be mentioned that the chain rule is being used. The chain rule says that (in clunky laymans terms), for $g(f(x))$, you take the derivative of $g(f(x))$, treating $f(x)$ as the variable, and then multiply by the derivative of $f(x)$. For our cost function, think of it this way:
$$ g(\theta_0, \theta_1) = \frac{1}{2m} \sum_{i=1}^m \left(f(\theta_0, \theta_1)^{(i)}\right)^2 \tag{1}$$
$$ f(\theta_0, \theta_1)^{(i)} = \theta_0 + \theta_{1}x^{(i)} - y^{(i)} \tag{2}$$
To show I'm not pulling funny business, sub in the definition of $f(\theta_0, \theta_1)^{(i)}$ into the definition of $g(\theta_0, \theta_1)$ and you get:
$$ g(f(\theta_0, \theta_1)^{(i)}) = \frac{1}{2m} \sum_{i=1}^m \left(\theta_0 + \theta_{1}x^{(i)} - y^{(i)}\right)^2 \tag{3}$$
This is, indeed, our entire cost function.
Thus, the partial derivatives work like this:
$$ \frac{\partial}{\partial \theta_0} g(\theta_0, \theta_1) = \frac{\partial}{\partial \theta_0} \frac{1}{2m} \sum_{i=1}^m \left(f(\theta_0, \theta_1)^{(i)}\right)^2 = 2 \times \frac{1}{2m} \sum_{i=1}^m \left(f(\theta_0, \theta_1)^{(i)}\right)^{2-1} = \tag{4}$$
$$\frac{1}{m} \sum_{i=1}^m f(\theta_0, \theta_1)^{(i)}$$
In other words, just treat $f(\theta_0, \theta_1)^{(i)}$ like a variable and you have a simple derivative of $\frac{1}{2m} x^2 = \frac{1}{m}x$
$$ \frac{\partial}{\partial \theta_0} f(\theta_0, \theta_1)^{(i)} = \frac{\partial}{\partial \theta_0} (\theta_0 + \theta_{1}x^{(i)} - y^{(i)}) \tag{5}$$
And $\theta_1, x$, and $y$ are just "a number" since we're taking the derivative with respect to $\theta_0$, so the partial of $g(\theta_0, \theta_1)$ becomes:
$$ \frac{\partial}{\partial \theta_0} f(\theta_0, \theta_1) = \frac{\partial}{\partial \theta_0} (\theta_0 + [a \ number][a \ number]^{(i)} - [a \ number]^{(i)}) = \frac{\partial}{\partial \theta_0} \theta_0 = 1 \tag{6}$$
So, using the chain rule, we have:
$$ \frac{\partial}{\partial \theta_0} g(f(\theta_0, \theta_1)^{(i)}) = \frac{\partial}{\partial \theta_0} g(\theta_0, \theta_1) \frac{\partial}{\partial \theta_0}f(\theta_0, \theta_1)^{(i)} \tag{7}$$
And subbing in the partials of $g(\theta_0, \theta_1)$ and $f(\theta_0, \theta_1)^{(i)}$ from above, we have:
$$ \frac{1}{m} \sum_{i=1}^m f(\theta_0, \theta_1)^{(i)} \frac{\partial}{\partial \theta_0}f(\theta_0, \theta_1)^{(i)} = \frac{1}{m} \sum_{i=1}^m \left(\theta_0 + \theta_{1}x^{(i)} - y^{(i)}\right) \times 1 = \tag{8}$$
$$ \frac{1}{m} \sum_{i=1}^m \left(\theta_0 + \theta_{1}x^{(i)} - y^{(i)}\right)$$
What about the derivative with respect to $\theta_1$?
Our term $g(\theta_0, \theta_1)$ is identical, so we just need to take the derivative of $f(\theta_0, \theta_1)^{(i)}$, this time treating $\theta_1$ as the variable and the other terms as "just a number." That goes like this:
$$ \frac{\partial}{\partial \theta_1} f(\theta_0, \theta_1)^{(i)} = \frac{\partial}{\partial \theta_1} (\theta_0 + \theta_{1}x^{(i)} - y^{(i)}) \tag{9}$$
$$ \frac{\partial}{\partial \theta_1} f(\theta_0, \theta_1)^{(i)} = \frac{\partial}{\partial \theta_1} ([a \ number] + \theta_{1}[a \ number, x^{(i)}] - [a \ number]) \tag{10}$$
Note that the "just a number", $x^{(i)}$, is important in this case because the derivative of $c \times x$ (where $c$ is some number) is $\frac{d}{dx}(c \times x^1) = c \times 1 \times x^{(1-1=0)} = c \times 1 \times 1 = c$, so the number will carry through. In this case that number is $x^{(i)}$ so we need to keep it. Thus, our derivative is:
$$ \frac{\partial}{\partial \theta_1} f(\theta_0, \theta_1)^{(i)} = 0 + (\theta_{1})^1 x^{(i)} - 0 = 1 \times \theta_1^{(1-1=0)} x^{(i)} = 1 \times 1 \times x^{(i)} = x^{(i)} \tag{11}$$
Thus, the entire answer becomes:
$$ \frac{\partial}{\partial \theta_1} g(f(\theta_0, \theta_1)^{(i)}) = \frac{\partial}{\partial \theta_1} g(\theta_0, \theta_1) \frac{\partial}{\partial \theta_1} f(\theta_0, \theta_1)^{(i)} = \tag{12}$$
$$\frac{1}{m} \sum_{i=1}^m f(\theta_0, \theta_1)^{(i)} \frac{\partial}{\partial \theta_1}f(\theta_0, \theta_1)^{(i)} = \frac{1}{m} \sum_{i=1}^m \left(\theta_0 + \theta_{1}x^{(i)} - y^{(i)}\right) x^{(i)}$$
A quick addition per @Hugo's comment below. Let's ignore the fact that we're dealing with vectors at all, which drops the summation and $fu^{(i)}$ bits. We can also more easily use real numbers this way.
$\require{cancel}$
Let's say $x = 2$ and $y = 4$.
So, for part 1 you have:
$$\frac{\partial}{\partial \theta_0} (\theta_0 + \theta_{1}x - y)$$
Filling in the values for $x$ and $y$, we have:
$$\frac{\partial}{\partial \theta_0} (\theta_0 + 2\theta_{1} - 4)$$
We only care about $\theta_0$, so $\theta_1$ is treated like a constant (any number, so let's just say it's 6).
$$\frac{\partial}{\partial \theta_0} (\theta_0 + (2 \times 6) - 4) = \frac{\partial}{\partial \theta_0} (\theta_0 + \cancel8) = 1$$
Using the same values, let's look at the $\theta_1$ case (same starting point with $x$ and $y$ values input):
$$\frac{\partial}{\partial \theta_1} (\theta_0 + 2\theta_{1} - 4)$$
In this case we do care about $\theta_1$, but $\theta_0$ is treated as a constant; we'll do the same as above and use 6 for it's value:
$$\frac{\partial}{\partial \theta_1} (6 + 2\theta_{1} - 4) = \frac{\partial}{\partial \theta_1} (2\theta_{1} + \cancel2) = 2 = x$$
The answer is 2 because we ended up with $2\theta_1$ and we had that because $x = 2$.
Hopefully the clarifies a bit on why in the first instance (wrt $\theta_0$) I wrote "just a number," and in the second case (wrt $\theta_1$) I wrote "just a number, $x^{(i)}$. While it's true that $x^{(i)}$ is still "just a number", since it's attached to the variable of interest in the second case it's value will carry through which is why we end up at $x^{(i)}$ for the result.
For your given function $f(x)=\frac{1}{x^2}$ this is not differentiable at 0 since the function isn't even continuous there since at 0 it is undefined. It approaches infinity from either side leading to a infinite gradient on either side but AT 0 it does not exist. For (3) I would think a better way of saying this is a 'kink' for example take $f(x)=|x|$. This is not differentiable at 0 since from 0+ we have gradient 1, and -1 from 0-. Or you can think about drawing lots of tangents at the kink, they all look like tangents but are a all different lines so it is not differentiable.
Best Answer
The usual thing one does (but this can vary according to what you're trying to achieve) is to compute a vector called the gradient of the function, notated $\nabla f$. The gradient has one component for each input to $f$; for each component you treat all other variables temporarily as constants and differentiate with respect to the chosen one: $$(\nabla f)(x,y) = \left(\frac{d}{dx}f(x,y), \frac{d}{dy}f(x,y)\right) = ( 2(x+y), 2(x+y) ) $$ In this example it may be a bit hard to see what's going on, because both components end up being the same, so let's try another one: $g(x,y)=x^2y+y$ gives $$\nabla g(x,y) = (2xy, x^2+1)$$
The intuitive property of the gradient is that if we have some small offset $(h,k)$ from $(x,y)$, then the difference $f(x+h,y+k)-f(x,y)$ is closely approximated by the dot product between $(h,k)$ and $\nabla f(x,y)$.
The components of the gradient are called partial derivatives and are often notated with a special sign as $\frac{\partial f}{\partial x}$ and $\frac{\partial f}{\partial y}$. The $\partial$ sign looks a bit like a $d$, but is different as a reminder that a single partial derivative is not usually enough to estimate $\Delta f$; you need both of them.