Firstly, let's make the convention that $x = x_1$ and $x_1^{(i)}$ expresses the value of $x_1$ at the $i-$ row (see example) or equivalently $x_1^{(i)}$ is the value of $x_1$ of the $i-$th training example. We notice that $x^{(1)}_1 = 1.$ We denote $^{k}\theta_j$ the value of $\theta_ j$ after $k$ updates (i.e. after $k$ repetitions of the algorithm). After one update we have:
$$
\ ^{1}\theta_{0}:= \, ^{0}\theta_{0} - \frac 14 \bigg[\left(^{0}\theta_0+\,^{0}\theta_1x_1^{(1)}-y^{(1)}\right)\cdot x_0^{(1)}
+ \left(^{0}\theta_0+\,^{0}\theta_1x_1^{(2)}-2\right)\cdot x_0^{(1)}
+ \left(^{0}\theta_0+\,^{0}\theta_1x_1^{(3)}-y^{(3)}\right)\cdot x_0^{(1)}
+ \left(^{0}\theta_0+\,^{0}\theta_1x_1^{(4)}-y^{(4)}\right)\cdot x_0^{(1)} \bigg]= 0,
$$
since $^{0}\theta_0 = 0,$ $^{0}\theta_1 = 1$ and $x_0^{(i)} = 1.$
$$
\ ^{1}\theta_{1}:= \, ^{0}\theta_{1} - \frac 14 \bigg[\left(^{0}\theta_0+\,^{0}\theta_1x_1^{(1)}-y^{(1)}\right)\cdot x_1^{(1)}
+ \left(^{0}\theta_0+\,^{0}\theta_1x_1^{(2)}-2\right)\cdot x_1^{(1)}
+ \left(^{0}\theta_0+\,^{0}\theta_1x_1^{(3)}-y^{(3)}\right)\cdot x_1^{(1)}
+ \left(^{0}\theta_0+\,^{0}\theta_1x_1^{(4)}-y^{(4)}\right)\cdot x_1^{(1)} \bigg]= 1.
$$
Thus, $^{1}\theta_1 = 1$ and $^{1}\theta_0 = 0.$ Notice that if we want to proceed to the next iteration, we need both $\theta_0$ and $\theta_1$ which we found at the previous step. So:
$$^{2}\theta_1 = : \, ^{1}\theta_{1} - \frac 14 \bigg[\left(^{1}\theta_1x_1^{(1)}-y^{(1)}\right)\cdot x_1^{(1)}
+ \left(^{1}\theta_0+\,^{1}\theta_1x_1^{(2)}-2\right)\cdot x_1^{(1)}
+ \left(^{1}\theta_0+\,^{1}\theta_1x_1^{(3)}-y^{(3)}\right)\cdot x_1^{(1)}
+ \left(^{1}\theta_0+\,^{1}\theta_1x_1^{(4)}-y^{(4)}\right)\cdot x_1^{(1)} \bigg]= 1$$
So, regardless how many updates we apply, the value of $\theta_1$ will be constantly equal to $1,$ since at every iteration we have $\theta_0 = 0$ and $\theta_1 = 1.$
About update 2: Here is What I would do if I were in your shoes. First of all, I would calculate separately $h_\theta(x^{(1)})$ and $h_\theta(x^{(2)}),$ where $\theta $ is our initial vector $\theta=[ 1 \quad 3 \quad 2 \quad 1]^T.$
We have:
$$h_\theta(x^{(1)})= 1\cdot 1 + 3\cdot 1 +2\cdot 2 + 1\cdot 1 = 9 $$
$$h_\theta(x^{(2)})= 1\cdot 1 + 3\cdot 4 + 2\cdot 2 + 1\cdot 5 = 22. $$
Thus, if we implement the algorithm, we get:
$$\begin{array}[t]{l}
\theta_0 = 1-\frac{0.5}{2}\cdot \left[(9-1)\cdot 1 +(22-2)\cdot 1 \right]=-6\\
\theta_1 = 3 - \frac{0.5}{2}\cdot\left[(9-1)\cdot 1 + (22-2)\cdot 4\right]=-19\\
\theta_3 = 2 -\frac{0.5}{2}\cdot \left[(9-1)\cdot 2+(22-2)\cdot 2\right]=-12\\
\theta_4 = 1-\frac{0.5}{2}\cdot \left[(9-1)\cdot 1+(22-2)\cdot 5\right]=-26
\end{array}
$$
Thus, after one update the new $\theta=[-6 \quad -19 \quad -12 \quad -26]^T.$
If you want to apply the algorithm once again, evaluate the new $h_\theta(x^{(1)})$ and $h_\theta(x^{(2)})$ and proceed as before.
Notice that the prof uses the convention:
$$\begin{array}[t]{c | c | c}
x_0 & x_1 & x_2 & x_3\\
\hline
x_0^{(1)} = 1 & x_1^{(1)} = 1 &x_2^{(1)} = 2 & x_3^{(1)} = 1\\
x_0^{(2)} = 1 & x_1^{(2)}=4 & x_2^{(2)} = 2 & x_3^{(2)} =5
\end{array}
$$
A simple example, let $f = \sin(\sum_{i=1}^n \alpha_i \theta_i)$. To compute all derivatives at a point you only have to evaluate $\sin$ once. If you cycle through all variables, you will have to evaluate $\sin$ $n$ times as the argument changes. Most often, it pays off to do steps in all coordinates at the same time. A simple analogy would be walking. You typically don't walk east-west direction first, and then north-south. You walk the shortest direction, i.e., move in both coordinates simultaneously.
Best Answer
The cost function is given by
$$J = \dfrac{1}{N}\sum_{n=1}^{N}\left[\boldsymbol{w}^T\boldsymbol{x}_n-y_n \right]^2.$$
Take the total derivative
$$dJ = \dfrac{1}{N}\sum_{n=1}^N\{2\left[\boldsymbol{w}^T\boldsymbol{x}_n-y_n \right]d\boldsymbol{w}^T\boldsymbol{x}_n \}.$$
As $d\boldsymbol{w}^T$ is not dependent on the summation index $n$ we can pull it out of the sum. We can put it in front of $ \left[\boldsymbol{w}^T\boldsymbol{x}_n-y_n \right]$ because it is a scalar. Hence we obtain
$$dJ = d\boldsymbol{w}^T\left[\dfrac{1}{N}\sum_{n=1}^N\{2\left[\boldsymbol{w}^T\boldsymbol{x}_n-y_n \right]\boldsymbol{x}_n \}\right].$$
Now, we know that the term in the bracket is the gradient of $J$ with respect to $\boldsymbol{w}$. Hence,
$$\text{grad}_{\boldsymbol{w}}J=\dfrac{1}{N}\sum_{n=1}^N\{2\left[\boldsymbol{w}^T\boldsymbol{x}_n-y_n \right]\boldsymbol{x}_n \}.$$
The explanation for gradient and total derivative relationship.
Let $J(\boldsymbol{w})=J(w_0,w_1,...,w_m)$ be a multivariate function. The total derivative of $J$ is given by
$$dJ = \dfrac{\partial J}{\partial w_0}dw_0+\dfrac{\partial J}{\partial w_1}dw_1+\ldots+\dfrac{\partial J}{\partial w_m}dw_m$$ $$=[dw_0, dw_1,\ldots, dw_m][\dfrac{\partial J}{\partial w_0},\dfrac{\partial J}{\partial w_1},\ldots,\dfrac{\partial J}{\partial w_m}]^T$$ $$=d\boldsymbol{w}^T\text{grad}_{\boldsymbol{w}}J.$$