At that point I was wondering why wouldn't we just find the zeros of the partial derivatives of $J(θ)$
Normally you cannot because the resulting equations are too complicated to be solved (e.g., in modern neural networks $J$ is extremely non-linear, non-convex, costly to evaluate, and complicated (even including control flow these days), and $\theta$ is tens or hundreds of millions of dimensions.)
but then it was introduced the normal equation, $θ=(X^TX)^{−1}X^Ty$ that supposedly has something to do with it
Yes, linear regression is a special case where you can analytically solve for the zeros of the partial derivatives of $J$.
Let $X\in\mathbb{R}^{m\times n}$ (with $x_i\in\mathbb{R}^m$ as the $i$th example i.e. column of $X$), $Y\in\mathbb{R}^m$, $\theta\in\mathbb{R}^n$, and $H=X\theta\in\mathbb{R}^m$ (so that $h_i=h_\theta(x_i)=\theta^Tx_i$). Then we may write
$$
J(\theta) = \frac{1}{2m}||H - Y||_2^2 = \frac{1}{2m}||X\theta - Y||_2^2 = \frac{1}{2m}\sum_i (h_\theta(x_i) - y_i)^2 = \frac{1}{2m}\sum_i (\theta^Tx_i - y_i)^2
$$
where $n$ is the number of features and $m$ is the number of examples.
Notice that in the ideal case we would have $X\theta = Y$, and we could just solve this like a linear system. But $X$ is usually not square, of course.
The normal equations are usually written $$ X^TX\theta = X^TY $$
which we can "derive" by starting from $X\theta = Y$ and left-multiplying by $X^T$.
Clearly, then, the solution to this equation is just $$ \theta = (X^TX)^{-1}X^T y $$
This is the kind of 30 second, baffling explanation that comes up in some ML classes, but let's talk about it in more detail, then discuss the relation to the gradient.
Note that the Gramian $G=X^TX\in\mathbb{R}^{n\times n}$ needs to be invertible for this to even work at all, and this may be unstable. The requirement for $G$ to be invertible is that the features are linearly independent (see here or here), which can be achieved by some simple feature selection process usually.
But, it is often better to use the pseudo-inverse in any case. But notice that the Moore-Penrose pseudoinverse for a matrix $X$ is exactly $ X^\dagger = (X^TX)^{-1}X^T$, meaning the "best" solution to $X\theta = Y$ is just
$ \theta = X^\dagger Y = (X^TX)^{-1}X^T Y $. I like this way of considering it, since the fact that $\theta$ minimizes $J$ follows from properties of the pseudo-inverse.
Note also that we can derive the normal equations geometrically.
See here or here.
Now for the link to the zeroing of the gradient. Note that
\begin{align*}
\frac{\partial J}{\partial \theta_\ell}
&= \frac{1}{2m}\sum_i \frac{\partial }{\partial \theta_\ell} (\theta^Tx_i - y_i)^2 \\
&= \frac{1}{2m}\sum_i (\theta^Tx_i - y_i) \frac{\partial }{\partial \theta_\ell} (\theta^Tx_i - y_i) \\
&= \frac{1}{2m}\sum_i (\theta^Tx_i - y_i) x_{i\ell} \\
&= \frac{1}{2m}\left[\sum_i
\theta^Tx_i x_{i\ell}
-
\sum_i y_i x_{i\ell}
\right] \\
\therefore\;\;
\nabla_\theta J
&= \frac{1}{2m}\bigg[
\underbrace{ \sum_i (\theta^Tx_i) x_{i} }_{X^TX\theta}
-
\underbrace{ \sum_i y_i x_{i} }_{X^T Y}
\bigg]
\end{align*}
where we used the facts that
\begin{align*}
G_{\alpha\beta}
&= [X^TX]_{\alpha\beta}
= \sum_\eta [X^T]_{\alpha\eta} x_{\eta\beta}
= \sum_\eta x_{\eta\alpha} x_{\eta\beta} \\
[X^TX\theta]_{\alpha}
&= \sum_\xi [X^TX]_{\alpha\xi} \theta_\xi
= \sum_\xi \sum_\eta x_{\eta\alpha} x_{\eta\xi} \theta_\xi
= \sum_\eta x_{\eta\alpha} \underbrace{\sum_\xi x_{\eta\xi} \theta_\xi}_{\theta^T x_\eta}
= \sum_\eta (\theta^T x_\eta) x_{\eta\alpha} \\
\therefore\;\; X^TX\theta &= \sum_i (\theta^T x_i) x_{i}
\end{align*}
Great, so putting this together we get:
$$
\nabla_\theta J = 0 \;\;\;\;\implies\;\;\;\; \frac{1}{2m}[ X^TX\theta - X^T Y ] = 0
$$
which exactly implies the normal equations:
$ X^TX\theta = X^T Y $. In other words, they follow exactly from trying to set the gradient to zero. So looking for a stationary point of $J$ via gradient descent is the same as solving the normal equations.
So is there any reason to use gradient descent instead of just the one line of code like
theta = pinv(X) @ Y
?
Well, if $n$ is large, inverting $G$ will be very expensive.
There are numerical instabilities to worry about as well.
See here for instance.
Related posts: [1], [2], [3], [4].
Best Answer
The way to use normal equations is just explicitly solving the equation as it has only one solution. The only reason why its not widely used instead of gradient descent is because it get to be very inefficient for large datasets. Look at the equations you wrote:
$$ \theta=(X^{T}X)^{-1}X^{T}y \tag2 $$
You have three matrix multiplications and one matrix inversion. Usually, the data matrix can be large, because its composed by the observed samples. Imagine $X$ matrix can be huge, and matrix multiplication has complexity $O(n^3)$.
However, if one wants to solve the linear system of equations by using the normal equation, one just have to do exactly what the equation says. If you are using numpy, for example, you will do something like:
Now just multiply them together:
Normal equation can be easily derived understanding the fact that a linear transformation $A$ respects:
$Ker(A^T) \perp Im(A)$,
which is, image of $A$ is the orthogonal complement of kernel of $A^T$. This means that if the linear system is not solvable (which is, $y$ vector is not spanned by image of $A$) then the best we can do is to find a vector perpendicular to image space of $A$ (since the vector that minimizes a distance between a hyperplane and a point is perpendicular to that hyperplane).
Edit: I just realize now that I think you want some example using some static dumb data. For applying the given example to a sample data, you only have to know how to create a matrix using numpy, which can easily be found on internet. It would look something like this: