[Math] Least squares approximation (matrices)

linear algebramatrices

I’m quite stuck on the question below. It keeps coming up in my exam and I don’t know how to do it! Please help me! Thanks!

Show that the matrix $P = A (A^tA)^{-1} A^t$ represents an orthogonal projection onto $\mathcal{R}(A)$. Hence or otherwise explain why $x^\star = (A^t A)^{-1} A^t b$ represents the least squares solution of the matrix equation $Ax=b$.

Best Answer

Well, okay, assuming $A$ is overdetermined ($m \times n$ where $m > n$). The column of vectors of $A$ spans $\mathcal R(A)$, assuming $A$ has full rank the column vectors form a basis.

The matrix $A^T A$ will be symmetric ($(A^TA)^T = A^T A$), and if $A$ has full rank, invertible. If $A$ does not have full rank, it will not be invertible.

Now, naming the column vectors in $A$ $a_1, a_2, \dots, a_n$ and $x = (x_1, x_2 \dots, x_n)^T$ we can write $Ax$ as: $$Ax = x_1 a_1 + x_2 a_2 + \dots + x_n a_n = \sum_{i=1}^n a_i x_i$$ and we want this to be equal to $b$. Assuming there is no such solution, we want to minimize $$\| b - Ax \| = \left\| b - \sum_{i=1}^n a_i x_i \right\|$$ To minimize this, we want to find the projection of $b$ onto $\mathcal R(A)$.This projection can be expressed as $\sum_{i=1}^n a_i x_i$, since the $a_i$ span $\mathcal R(A)$. Why does this minimze the above equation? Well, simply because this is the closest we can get - we can't express elements not in $\mathcal R(A)$ with elements in $\mathcal R(A)$.

Now, assume $x_1, x_2, \dots, x_n$ are the coefficients to $a_1, a_2, \dots a_n$ such that $\sum a_i x_i$ is the orthogonal projection of $b$ onto $\mathcal R(A)$. Then $b$ can written as $$b = Ax + y$$ for some $y$ which is orthogonal to $\mathcal R(A)$. Thus $y = b - Ax$. Since $y$ is orthogonal to $\mathcal R(A)$, it is orthogonal to every $a_i$: $$ \left\langle a_i , \underbrace{b - Ax}_{=y} \right\rangle = 0 \Longleftrightarrow a_i^T(b-Ax)=0 ~~~~ \forall i = 1, \dots, n $$ This can be written in matrix form as $A^T(b-Ax)=0$ taking the equations above for all $i$ at once, which can be rewritten as $A^Tb = A^TAx$, or if $A^TA$ is invertible, $x = (A^TA)^{-1}A^Tb$.

So, basically, to derive $x = (A^TA)^{-1}A^T b$, consider the condition that $b-Ax$ should be orthogonal to the column vectors of $a_i$ one at a time, then stack these equations "on top of each other" in a matrix equation.