Calculus – Why Minimizing Least Squares Finds Projection Matrix

calculusleast squareslinear algebraregressionvector-spaces

I understand the derivation for $\hat{x}=A^Tb(A^TA)^{-1}$, but I'm having trouble explicitly connecting it to least squares regression.

So suppose we have a system of equations: $A=\begin{bmatrix}1 & 1\\1 & 2\\1 &3\end{bmatrix}, x=\begin{bmatrix}C\\D\end{bmatrix}, b=\begin{bmatrix}1\\2\\2\end{bmatrix}$

Using $\hat{x}=A^Tb(A^TA)^{-1}$, we know that $D=\frac{1}{2}, C=\frac{2}{3}$. But this is also equivalent to minimizing the sum of squares: $e^2_1+e^2_2+e^2_3 = (C+D-1)^2+(C+2D-2)^2+(C+3D-2)^2$.

I know the linear algebra approach is finding a hyperplane that minimizes the distance between points and the plane, but I'm having trouble understanding why it minimizes the squared distance. My intuition tells me it should minimize absolute distance, but I know this is wrong because it's possible for there to be non-unique solutions.

Why is this so? Any help would be greatly appreciated. Thanks!

Best Answer

You should be multiplying by $(A^T A)^{-1}$ on the left, not the right. Anyway, the geometric point is that you want $Ax-b$ to be perpendicular to $Ay$ for every vector $y$. (I think this is most easily seen by a geometric argument, which can be easily found in books, but which I can't easily render here.) This translates to $(Ay)^T(Ax-b)=0$ for every $y$, which is the same as $y^T(A^T(Ax-b))=0$ for every $y$. This can only happen if $A^T(Ax-b)=0$, which rearranges to your form if $A^T A$ is invertible (as is usually the case).

Also, it is the same to minimize the square of the Euclidean distance as it is to minimize the Euclidean distance itself. (This is also true of any other nonnegative quantity.) What would be different is minimizing some other distance, like the "taxicab" distance (where you sum the absolute values). Why we should choose to minimize the Euclidean distance in the first place is not a purely mathematical question, it depends on where the problem is coming from. That question is a bit off-topic here, though, and has also been asked before on MSE. (The short version of that discussion: "it's mathematically convenient" and "see the Gauss-Markov theorem".)

Related Question