The answer above is a good one, but I thought I'd add in some more "layman's" terms that helped me better understand concepts of partial derivatives. The answers I've seen here and in the Coursera forums leave out talking about the chain rule, which is important to know if you're going to get what this is doing...
It's helpful for me to think of partial derivatives this way: the variable you're
focusing on is treated as a variable, the other terms just numbers. Other key
concepts that are helpful:
- For "regular derivatives" of a simple form like $F(x) = cx^n$ , the derivative is simply $F'(x) = cn \times x^{n-1}$
- The derivative of a constant (a number) is 0.
- Summations are just passed on in derivatives; they don't affect the derivative. Just copy them down in place as you derive.
Also, it should be mentioned that the chain
rule is being used. The chain rule says
that (in clunky laymans terms), for $g(f(x))$, you take the derivative of $g(f(x))$,
treating $f(x)$ as the variable, and then multiply by the derivative of $f(x)$. For
our cost function, think of it this way:
$$ g(\theta_0, \theta_1) = \frac{1}{2m} \sum_{i=1}^m \left(f(\theta_0,
\theta_1)^{(i)}\right)^2 \tag{1}$$
$$ f(\theta_0, \theta_1)^{(i)} = \theta_0 + \theta_{1}x^{(i)} -
y^{(i)} \tag{2}$$
To show I'm not pulling funny business, sub in the definition of $f(\theta_0,
\theta_1)^{(i)}$ into the definition of $g(\theta_0, \theta_1)$ and you get:
$$ g(f(\theta_0, \theta_1)^{(i)}) = \frac{1}{2m} \sum_{i=1}^m \left(\theta_0 +
\theta_{1}x^{(i)} - y^{(i)}\right)^2 \tag{3}$$
This is, indeed, our entire cost function.
Thus, the partial derivatives work like this:
$$ \frac{\partial}{\partial \theta_0} g(\theta_0, \theta_1) = \frac{\partial}{\partial
\theta_0} \frac{1}{2m} \sum_{i=1}^m \left(f(\theta_0, \theta_1)^{(i)}\right)^2 = 2
\times \frac{1}{2m} \sum_{i=1}^m \left(f(\theta_0, \theta_1)^{(i)}\right)^{2-1} = \tag{4}$$
$$\frac{1}{m}
\sum_{i=1}^m f(\theta_0, \theta_1)^{(i)}$$
In other words, just treat $f(\theta_0, \theta_1)^{(i)}$ like a variable and you have a
simple derivative of $\frac{1}{2m} x^2 = \frac{1}{m}x$
$$ \frac{\partial}{\partial \theta_0} f(\theta_0, \theta_1)^{(i)} = \frac{\partial}{\partial \theta_0} (\theta_0 + \theta_{1}x^{(i)} - y^{(i)}) \tag{5}$$
And $\theta_1, x$, and $y$ are just "a number" since we're taking the derivative with
respect to $\theta_0$, so the partial of $g(\theta_0, \theta_1)$ becomes:
$$ \frac{\partial}{\partial \theta_0} f(\theta_0, \theta_1) = \frac{\partial}{\partial \theta_0} (\theta_0 + [a \
number][a \ number]^{(i)} - [a \ number]^{(i)}) = \frac{\partial}{\partial \theta_0}
\theta_0 = 1 \tag{6}$$
So, using the chain rule, we have:
$$ \frac{\partial}{\partial \theta_0} g(f(\theta_0, \theta_1)^{(i)}) =
\frac{\partial}{\partial \theta_0} g(\theta_0, \theta_1) \frac{\partial}{\partial
\theta_0}f(\theta_0, \theta_1)^{(i)} \tag{7}$$
And subbing in the partials of $g(\theta_0, \theta_1)$ and $f(\theta_0, \theta_1)^{(i)}$
from above, we have:
$$ \frac{1}{m} \sum_{i=1}^m f(\theta_0, \theta_1)^{(i)} \frac{\partial}{\partial
\theta_0}f(\theta_0, \theta_1)^{(i)} = \frac{1}{m} \sum_{i=1}^m \left(\theta_0 +
\theta_{1}x^{(i)} - y^{(i)}\right) \times 1 = \tag{8}$$
$$ \frac{1}{m} \sum_{i=1}^m \left(\theta_0 + \theta_{1}x^{(i)} - y^{(i)}\right)$$
What about the derivative with respect to $\theta_1$?
Our term $g(\theta_0, \theta_1)$ is identical, so we just need to take the derivative
of $f(\theta_0, \theta_1)^{(i)}$, this time treating $\theta_1$ as the variable and the
other terms as "just a number." That goes like this:
$$ \frac{\partial}{\partial \theta_1} f(\theta_0, \theta_1)^{(i)} = \frac{\partial}{\partial \theta_1} (\theta_0 + \theta_{1}x^{(i)} - y^{(i)}) \tag{9}$$
$$ \frac{\partial}{\partial
\theta_1} f(\theta_0, \theta_1)^{(i)} = \frac{\partial}{\partial \theta_1} ([a \ number] +
\theta_{1}[a \ number, x^{(i)}] - [a \ number]) \tag{10}$$
Note that the "just a number", $x^{(i)}$, is important in this case because the
derivative of $c \times x$ (where $c$ is some number) is $\frac{d}{dx}(c \times x^1) =
c \times 1 \times x^{(1-1=0)} = c \times 1 \times 1 = c$, so the number will carry
through. In this case that number is $x^{(i)}$ so we need to keep it. Thus, our
derivative is:
$$ \frac{\partial}{\partial \theta_1} f(\theta_0, \theta_1)^{(i)} = 0 + (\theta_{1})^1
x^{(i)} - 0 = 1 \times \theta_1^{(1-1=0)} x^{(i)} = 1 \times 1 \times x^{(i)} =
x^{(i)} \tag{11}$$
Thus, the entire answer becomes:
$$ \frac{\partial}{\partial \theta_1} g(f(\theta_0, \theta_1)^{(i)}) =
\frac{\partial}{\partial \theta_1} g(\theta_0, \theta_1) \frac{\partial}{\partial
\theta_1} f(\theta_0, \theta_1)^{(i)} = \tag{12}$$
$$\frac{1}{m} \sum_{i=1}^m f(\theta_0, \theta_1)^{(i)} \frac{\partial}{\partial
\theta_1}f(\theta_0, \theta_1)^{(i)} = \frac{1}{m} \sum_{i=1}^m \left(\theta_0 +
\theta_{1}x^{(i)} - y^{(i)}\right) x^{(i)}$$
A quick addition per @Hugo's comment below. Let's ignore the fact that we're dealing with vectors at all, which drops the summation and $fu^{(i)}$ bits. We can also more easily use real numbers this way.
$\require{cancel}$
Let's say $x = 2$ and $y = 4$.
So, for part 1 you have:
$$\frac{\partial}{\partial \theta_0} (\theta_0 + \theta_{1}x - y)$$
Filling in the values for $x$ and $y$, we have:
$$\frac{\partial}{\partial \theta_0} (\theta_0 + 2\theta_{1} - 4)$$
We only care about $\theta_0$, so $\theta_1$ is treated like a constant (any number, so let's just say it's 6).
$$\frac{\partial}{\partial \theta_0} (\theta_0 + (2 \times 6) - 4) = \frac{\partial}{\partial \theta_0} (\theta_0 + \cancel8) = 1$$
Using the same values, let's look at the $\theta_1$ case (same starting point with $x$ and $y$ values input):
$$\frac{\partial}{\partial \theta_1} (\theta_0 + 2\theta_{1} - 4)$$
In this case we do care about $\theta_1$, but $\theta_0$ is treated as a constant; we'll do the same as above and use 6 for it's value:
$$\frac{\partial}{\partial \theta_1} (6 + 2\theta_{1} - 4) = \frac{\partial}{\partial \theta_1} (2\theta_{1} + \cancel2) = 2 = x$$
The answer is 2 because we ended up with $2\theta_1$ and we had that because $x = 2$.
Hopefully the clarifies a bit on why in the first instance (wrt $\theta_0$) I wrote "just a number," and in the second case (wrt $\theta_1$) I wrote "just a number, $x^{(i)}$. While it's true that $x^{(i)}$ is still "just a number", since it's attached to the variable of interest in the second case it's value will carry through which is why we end up at $x^{(i)}$ for the result.
Best Answer
For the most general case, think about a mixing board.
Each input argument to the function is represented by a slider with an associated piece of a real number line along one side, just like in the picture. If you are thinking of a function which can accept arbitrary real number inputs, the slider will have to be infinitely long, of course, which of course is not possible in real life, but is in the imaginary, ideal world of mathematics. This mixing board also has a dial on it, which displays the number corresponding to the function's output.
The partial derivative of the function with respect to one of its input arguments corresponds to how sensitive the readout on the dial is if you wiggle the slider representing that argument just a little bit around wherever it's currently set - that is, how much more or less dramatic the changes in what is shown are compared to the size of your wiggle. If you wiggle a slider by, say, 0.0001, and the value changes by a factor 0.0002, the partial derivative with respect to that variable at the given setting is (likely only approximately) 2. If the value changes in an opposite sense, i.e. goes down when you move the slider up, the derivative is negative.
The gradient, then, is the ordered list of signed proportions by which you have to "wiggle" all the sliders so as to achieve the strongest possible, but still small, positive wiggle in the value on the dial. This is a vector, because you can think of vectors as ordered lists of quantities for which we can subject to elementwise addition and elementwise multiplication by a single number.
And of course, when I say "small" here I mean "ideally small" - i.e. "just on the cusp of being zero" which, of course, you can make formally rigorous in a number of ways, such as by using limits.