[Math] Roles of $\bf A^TA$ ($\text {A transpose A}$) matrices in orthogonal projection

linear algebraprojection-matrices

$\bf A^TA$ forms (or equivalently (?) positive semidefinite matrices, or more particularly, covariance matrices($\bf \Sigma$)) are linked in practice to many operations in which data points are orthogonally projected:

  1. In ordinary linear regression (OLS) is part of the projection matrix $\bf P = X(\color{blue}{X^TX})^{−1}X^T$ of the "dependent variable" on the column space of the model matrix.

  2. In principal component analysis (PCA) the data is projected on the eigenvectors of the covariance matrix.

  3. The covariance matrix informs white random "white" samples into diagonal projections in Gaussian processes, which seems intuitively to correspond to a way of projecting.

But I am looking at a unifying explanation. A more generic concept.

In this regard, I have come across the sentence, "It is as if the covariance matrix stored all possible projection variances in all directions," a statement seemingly supported by the fact that a for data cloud in $\mathbb R^n$, the variance of the projection of the points onto a unit vector $\bf u$ will be given by $\bf u^T \Sigma u$.

So is there a way of unify all these inter-related properties into a single set of principles from which all the applications and geometric derivations can be seen?

I believe that the unifying theme is related to the the orthogonal diagonalization $\bf A^T A = U^T D U$ as mentioned here, but I'd like to see this idea explained a bit further.


EXEGETICAL APPENDIX for novices:

It was far from self-evident, but after some help by Michael Hardy and @stewbasic, the answer by Étienne Bézout may be starting to click. So like in the move Memento, I'd better tattoo what I got so far here in case it is blurry in the morning:

Concept One:

Block matrix multiplication:

\begin{align}
A^\top A & = \begin{bmatrix} \vdots & \vdots & \vdots & \cdots & \vdots \\
a_1^\top & a_2^\top & a_3^\top & \cdots & a_{\color{blue}{\bf n}}^\top\\
\vdots & \vdots & \vdots & \cdots & \vdots\end{bmatrix}
\begin{bmatrix}
\cdots & a_1 & \cdots\\
\cdots & a_2 & \cdots \\
\cdots & a_3 & \cdots \\
& \vdots&\\
\cdots & a_{\color{blue}{\bf n}} & \cdots
\end{bmatrix}\\
&= a_1^\top a_1 + a_2^\top a_2 + a_3^\top a_3 + \cdots+a_n^\top a_n\tag{1}
\end{align}

where $a_i$'s are $[\color{blue}{1 \times \bf n}]$ row vectors.


Concept Two:

The $\color{blue}{\bf n}$.

We have the same dimensions for the block matrix multiplication $\bf \underset{[\text{many rows} \times \color{blue}{\bf n}]}{\bf A^\top}\underset{[\color{blue}{\bf n} \times \text{many rows}]}{\bf A} =\large [\color{blue}{\bf n} \times \color{blue}{\bf n}] \small \text{ matrix}$, as for each individual summand $\bf a_i^\top a_i$ in Eq. 1.


Concept Three:

$\bf a_i^\top a_i$ is deceptive because of the key definition: row vector.

Because $\bf a_i$ was defined as a row vector, and the $\bf a_i$ vectors are normalized ($\vert a_i \vert =1$), $\bf a_i^\top a_i$ is really a matrix of the form $\bf XX^\top$, which is a projection matrix provided the $a_i$ vectors are independent (check: "…are linearly independent"), and orthonormal (not a requisite in the answer ("I'm no longer saying they are orthogonal")) – $\color{red}{\text{Do these vectors actually need to be defined as orthonormal?}}$ Or can this constraint of orthonormality of the vectors $a_i$ be relaxed, or implicitly fulfilled by virtue of other considerations? Otherwise we have a rather specific $\bf A$ matrix, making the results less generalizable.


Concept Four:

A projection onto what?

Onto the subspace spanned by the column space of $\bf X$ (think OLS projection ${\bf A}\color{gray}{(A^\top A)^{-1}} {\bf A^\top}$). But what is $\bf X$ here? None other than $\bf a_i^\top$, and since $\bf a_i$ is a row vector, $\bf a_i^\top$ is a column vector.

So we are doing ortho-projections onto the column space of $\bf A^\top$, which is in $\mathbb R^{\color{blue}{\bf n}}$.

I was hoping that the last sentence could have been, "… onto the column space of $\bf A$…


What are the implications?

Best Answer

Suppose we are given a matrix $\mathrm A$ that has full column rank. Its SVD is of the form

$$\mathrm A = \mathrm U \Sigma \mathrm V^T = \begin{bmatrix} \mathrm U_1 & \mathrm U_2\end{bmatrix} \begin{bmatrix} \hat\Sigma\\ \mathrm O\end{bmatrix} \mathrm V^T$$

where the zero matrix may be empty. Note that

$$\mathrm A \mathrm A^T = \mathrm U \Sigma \mathrm V^T \mathrm V \Sigma^T \mathrm U^T = \mathrm U \begin{bmatrix} \hat\Sigma^2 & \mathrm O\\ \mathrm O & \mathrm O\end{bmatrix} \mathrm U^T$$

can only be a projection matrix if $\hat\Sigma = \mathrm I$. However,

$$\begin{array}{rl} \mathrm A (\mathrm A^T \mathrm A)^{-1} \mathrm A^T &= \mathrm U \Sigma \mathrm V^T (\mathrm V \Sigma^T \mathrm U^T \mathrm U \Sigma \mathrm V^T)^{-1} \mathrm V \Sigma^T \mathrm U^T\\ &= \mathrm U \Sigma \mathrm V^T (\mathrm V \Sigma^T \mathrm \Sigma \mathrm V^T)^{-1} \mathrm V \Sigma^T \mathrm U^T\\ &= \mathrm U \Sigma \mathrm V^T (\mathrm V \hat\Sigma^2 \mathrm V^T)^{-1} \mathrm V \Sigma^T \mathrm U^T\\ &= \mathrm U \Sigma \mathrm V^T \mathrm V \hat\Sigma^{-2} \mathrm V^T \mathrm V \Sigma^T \mathrm U^T\\ &= \mathrm U \Sigma \hat\Sigma^{-2} \Sigma^T \mathrm U^T\\ &= \mathrm U \begin{bmatrix} \mathrm I & \mathrm O\\ \mathrm O & \mathrm O\end{bmatrix} \mathrm U^T = \mathrm U_1 \mathrm U_1^T\end{array}$$

is always a projection matrix.