Normal equation as alternative for gradient descent to find \theta values

linear algebramachine learning

I am doing a machine learning course and at first it was taught the gradient descent algorithm to minimize the cost function $J\left( \theta \right)$. At that point I was wondering why wouldn't we just find the zeros of the partial derivatives of $J\left( \theta \right)$ to achieve this but then it was introduced the normal equation, $\theta=\left( X^{T}X\right) ^{-1}X^{T}y$ that supposedly has something to do with it, or at least that was the idea that I got from the lecture. I have some knowledge of linear algebra, not extensively though, and I have been trying to understand it on my own how this relates to finding the zeros of the partial derivatives but so far I have failed. Can someone give some intuition on how this works or at least point me in the right direction?

Edit to add some context:

$$J\left( \theta \right) = \dfrac {1}{2m}\sum ^{m}_{i=1}\left( h_{\theta }\left( x^{i}\right) -y^{i}\right) ^{2} $$

is the cost function for the hypothesis function $h\left( \theta \right) =
\begin{bmatrix} \theta _{0} & \theta _{1} & \ldots & \theta _{n} \end{bmatrix}\begin{bmatrix} x_{0} \\ x_{1} \\ \vdots \\ x_{n} \end{bmatrix} =
\theta ^{T}x$
concerning a certain training set of $m$ rows and $n$ features. The multidimensional vector of the results of the training set is $y$.

Best Answer

At that point I was wondering why wouldn't we just find the zeros of the partial derivatives of $J(θ)$

Normally you cannot because the resulting equations are too complicated to be solved (e.g., in modern neural networks $J$ is extremely non-linear, non-convex, costly to evaluate, and complicated (even including control flow these days), and $\theta$ is tens or hundreds of millions of dimensions.)

but then it was introduced the normal equation, $θ=(X^TX)^{−1}X^Ty$ that supposedly has something to do with it

Yes, linear regression is a special case where you can analytically solve for the zeros of the partial derivatives of $J$.


Let $X\in\mathbb{R}^{m\times n}$ (with $x_i\in\mathbb{R}^m$ as the $i$th example i.e. column of $X$), $Y\in\mathbb{R}^m$, $\theta\in\mathbb{R}^n$, and $H=X\theta\in\mathbb{R}^m$ (so that $h_i=h_\theta(x_i)=\theta^Tx_i$). Then we may write $$ J(\theta) = \frac{1}{2m}||H - Y||_2^2 = \frac{1}{2m}||X\theta - Y||_2^2 = \frac{1}{2m}\sum_i (h_\theta(x_i) - y_i)^2 = \frac{1}{2m}\sum_i (\theta^Tx_i - y_i)^2 $$ where $n$ is the number of features and $m$ is the number of examples. Notice that in the ideal case we would have $X\theta = Y$, and we could just solve this like a linear system. But $X$ is usually not square, of course.

The normal equations are usually written $$ X^TX\theta = X^TY $$ which we can "derive" by starting from $X\theta = Y$ and left-multiplying by $X^T$. Clearly, then, the solution to this equation is just $$ \theta = (X^TX)^{-1}X^T y $$ This is the kind of 30 second, baffling explanation that comes up in some ML classes, but let's talk about it in more detail, then discuss the relation to the gradient.

Note that the Gramian $G=X^TX\in\mathbb{R}^{n\times n}$ needs to be invertible for this to even work at all, and this may be unstable. The requirement for $G$ to be invertible is that the features are linearly independent (see here or here), which can be achieved by some simple feature selection process usually. But, it is often better to use the pseudo-inverse in any case. But notice that the Moore-Penrose pseudoinverse for a matrix $X$ is exactly $ X^\dagger = (X^TX)^{-1}X^T$, meaning the "best" solution to $X\theta = Y$ is just $ \theta = X^\dagger Y = (X^TX)^{-1}X^T Y $. I like this way of considering it, since the fact that $\theta$ minimizes $J$ follows from properties of the pseudo-inverse.

Note also that we can derive the normal equations geometrically. See here or here.

Now for the link to the zeroing of the gradient. Note that \begin{align*} \frac{\partial J}{\partial \theta_\ell} &= \frac{1}{2m}\sum_i \frac{\partial }{\partial \theta_\ell} (\theta^Tx_i - y_i)^2 \\ &= \frac{1}{2m}\sum_i (\theta^Tx_i - y_i) \frac{\partial }{\partial \theta_\ell} (\theta^Tx_i - y_i) \\ &= \frac{1}{2m}\sum_i (\theta^Tx_i - y_i) x_{i\ell} \\ &= \frac{1}{2m}\left[\sum_i \theta^Tx_i x_{i\ell} - \sum_i y_i x_{i\ell} \right] \\ \therefore\;\; \nabla_\theta J &= \frac{1}{2m}\bigg[ \underbrace{ \sum_i (\theta^Tx_i) x_{i} }_{X^TX\theta} - \underbrace{ \sum_i y_i x_{i} }_{X^T Y} \bigg] \end{align*} where we used the facts that \begin{align*} G_{\alpha\beta} &= [X^TX]_{\alpha\beta} = \sum_\eta [X^T]_{\alpha\eta} x_{\eta\beta} = \sum_\eta x_{\eta\alpha} x_{\eta\beta} \\ [X^TX\theta]_{\alpha} &= \sum_\xi [X^TX]_{\alpha\xi} \theta_\xi = \sum_\xi \sum_\eta x_{\eta\alpha} x_{\eta\xi} \theta_\xi = \sum_\eta x_{\eta\alpha} \underbrace{\sum_\xi x_{\eta\xi} \theta_\xi}_{\theta^T x_\eta} = \sum_\eta (\theta^T x_\eta) x_{\eta\alpha} \\ \therefore\;\; X^TX\theta &= \sum_i (\theta^T x_i) x_{i} \end{align*} Great, so putting this together we get: $$ \nabla_\theta J = 0 \;\;\;\;\implies\;\;\;\; \frac{1}{2m}[ X^TX\theta - X^T Y ] = 0 $$ which exactly implies the normal equations: $ X^TX\theta = X^T Y $. In other words, they follow exactly from trying to set the gradient to zero. So looking for a stationary point of $J$ via gradient descent is the same as solving the normal equations.

So is there any reason to use gradient descent instead of just the one line of code like theta = pinv(X) @ Y? Well, if $n$ is large, inverting $G$ will be very expensive. There are numerical instabilities to worry about as well. See here for instance.

Related posts: [1], [2], [3], [4].

Related Question