$\bf A^TA$ forms (or equivalently (?) positive semidefinite matrices, or more particularly, covariance matrices($\bf \Sigma$)) are linked in practice to many operations in which data points are orthogonally projected:
-
In ordinary linear regression (OLS) is part of the projection matrix $\bf P = X(\color{blue}{X^TX})^{−1}X^T$ of the "dependent variable" on the column space of the model matrix.
-
In principal component analysis (PCA) the data is projected on the eigenvectors of the covariance matrix.
-
The covariance matrix informs white random "white" samples into diagonal projections in Gaussian processes, which seems intuitively to correspond to a way of projecting.
But I am looking at a unifying explanation. A more generic concept.
In this regard, I have come across the sentence, "It is as if the covariance matrix stored all possible projection variances in all directions," a statement seemingly supported by the fact that a for data cloud in $\mathbb R^n$, the variance of the projection of the points onto a unit vector $\bf u$ will be given by $\bf u^T \Sigma u$.
So is there a way of unify all these inter-related properties into a single set of principles from which all the applications and geometric derivations can be seen?
I believe that the unifying theme is related to the the orthogonal diagonalization $\bf A^T A = U^T D U$ as mentioned here, but I'd like to see this idea explained a bit further.
EXEGETICAL APPENDIX for novices:
It was far from self-evident, but after some help by Michael Hardy and @stewbasic, the answer by Étienne Bézout may be starting to click. So like in the move Memento, I'd better tattoo what I got so far here in case it is blurry in the morning:
Concept One:
\begin{align}
A^\top A & = \begin{bmatrix} \vdots & \vdots & \vdots & \cdots & \vdots \\
a_1^\top & a_2^\top & a_3^\top & \cdots & a_{\color{blue}{\bf n}}^\top\\
\vdots & \vdots & \vdots & \cdots & \vdots\end{bmatrix}
\begin{bmatrix}
\cdots & a_1 & \cdots\\
\cdots & a_2 & \cdots \\
\cdots & a_3 & \cdots \\
& \vdots&\\
\cdots & a_{\color{blue}{\bf n}} & \cdots
\end{bmatrix}\\
&= a_1^\top a_1 + a_2^\top a_2 + a_3^\top a_3 + \cdots+a_n^\top a_n\tag{1}
\end{align}
where $a_i$'s are $[\color{blue}{1 \times \bf n}]$ row vectors.
Concept Two:
The $\color{blue}{\bf n}$.
We have the same dimensions for the block matrix multiplication $\bf \underset{[\text{many rows} \times \color{blue}{\bf n}]}{\bf A^\top}\underset{[\color{blue}{\bf n} \times \text{many rows}]}{\bf A} =\large [\color{blue}{\bf n} \times \color{blue}{\bf n}] \small \text{ matrix}$, as for each individual summand $\bf a_i^\top a_i$ in Eq. 1.
Concept Three:
$\bf a_i^\top a_i$ is deceptive because of the key definition: row vector.
Because $\bf a_i$ was defined as a row vector, and the $\bf a_i$ vectors are normalized ($\vert a_i \vert =1$), $\bf a_i^\top a_i$ is really a matrix of the form $\bf XX^\top$, which is a projection matrix provided the $a_i$ vectors are independent (check: "…are linearly independent"), and orthonormal (not a requisite in the answer ("I'm no longer saying they are orthogonal")) – $\color{red}{\text{Do these vectors actually need to be defined as orthonormal?}}$ Or can this constraint of orthonormality of the vectors $a_i$ be relaxed, or implicitly fulfilled by virtue of other considerations? Otherwise we have a rather specific $\bf A$ matrix, making the results less generalizable.
Concept Four:
A projection onto what?
Onto the subspace spanned by the column space of $\bf X$ (think OLS projection ${\bf A}\color{gray}{(A^\top A)^{-1}} {\bf A^\top}$). But what is $\bf X$ here? None other than $\bf a_i^\top$, and since $\bf a_i$ is a row vector, $\bf a_i^\top$ is a column vector.
So we are doing ortho-projections onto the column space of $\bf A^\top$, which is in $\mathbb R^{\color{blue}{\bf n}}$.
I was hoping that the last sentence could have been, "… onto the column space of $\bf A$…
What are the implications?
Best Answer
Suppose we are given a matrix $\mathrm A$ that has full column rank. Its SVD is of the form
$$\mathrm A = \mathrm U \Sigma \mathrm V^T = \begin{bmatrix} \mathrm U_1 & \mathrm U_2\end{bmatrix} \begin{bmatrix} \hat\Sigma\\ \mathrm O\end{bmatrix} \mathrm V^T$$
where the zero matrix may be empty. Note that
$$\mathrm A \mathrm A^T = \mathrm U \Sigma \mathrm V^T \mathrm V \Sigma^T \mathrm U^T = \mathrm U \begin{bmatrix} \hat\Sigma^2 & \mathrm O\\ \mathrm O & \mathrm O\end{bmatrix} \mathrm U^T$$
can only be a projection matrix if $\hat\Sigma = \mathrm I$. However,
$$\begin{array}{rl} \mathrm A (\mathrm A^T \mathrm A)^{-1} \mathrm A^T &= \mathrm U \Sigma \mathrm V^T (\mathrm V \Sigma^T \mathrm U^T \mathrm U \Sigma \mathrm V^T)^{-1} \mathrm V \Sigma^T \mathrm U^T\\ &= \mathrm U \Sigma \mathrm V^T (\mathrm V \Sigma^T \mathrm \Sigma \mathrm V^T)^{-1} \mathrm V \Sigma^T \mathrm U^T\\ &= \mathrm U \Sigma \mathrm V^T (\mathrm V \hat\Sigma^2 \mathrm V^T)^{-1} \mathrm V \Sigma^T \mathrm U^T\\ &= \mathrm U \Sigma \mathrm V^T \mathrm V \hat\Sigma^{-2} \mathrm V^T \mathrm V \Sigma^T \mathrm U^T\\ &= \mathrm U \Sigma \hat\Sigma^{-2} \Sigma^T \mathrm U^T\\ &= \mathrm U \begin{bmatrix} \mathrm I & \mathrm O\\ \mathrm O & \mathrm O\end{bmatrix} \mathrm U^T = \mathrm U_1 \mathrm U_1^T\end{array}$$
is always a projection matrix.