What does inverse of a matrix that is the transpose of a matrix times itself mean in linear regression

inverselinear regressionmatricestranspose

What does the below expression mean when computing linear regression coefficients,residual standard error,etc…

$(AA^{\rm T})^{-1}$

Like coefficients are calculated through:

$(AA^{\rm T})^{-1}A^{\rm T}y$

where
A:feature matrix
y: response variable

Best Answer

I suspect that you really meant to write $(A^TA)^{-1}$ instead of $(AA^T)^{-1}$. The former is the usual expression that appears in the least-squares approximation for a solution to $Ax=y$. The expression $A^TA$ is called the Gram matrix of $A$ and turns up in many contexts. Its entries are the pairwise dot products of the columns of $A$. It is invertible iff those columns are linearly independent.

In this context, it turns up because we are essentially computing the orthogonal projection of $y$ onto the column space of $A$. (Why? Because $A\hat x=\hat y$ can only have a solution when $\hat y$ is in $A$’s column space.) The complete expression for this projection is $A(A^TA)^{-1}A^Ty$; without the leading $A$ factor, what you end up with is the coordinates of this projection relative to the $A$-basis. Comparing this to the orthogonal projection of $y$ onto a single vector $a$, namely ${aa^T\over a^Ta}y = a(a^Ta)^{-1}a^Ty$, we can see that the Gram matrix of $A$ plays an analogous role to the normalizing factor $a^Ta$ (which is just the dot product of $a$ with itself). When projecting onto a subspace that has dimension greater than one, though, there’s more to do than simply normalize the columns of $A$. We also need to deal with the “crosstalk” between basis vectors when they’re not orthogonal. (See this answer for more details of what happens when the basis vectors aren’t orthogonal.) The Gram matrix of $A$ encodes both the norms of the basis vectors and how projections onto them overlap. Inverting this matrix sorts all of this out—for me still somewhat magically, to be honest.

Related Question