There is a matrix equation for solving a linear regression, $\vec{y}=X\vec{\beta}$ where $X$ is the matrix of features, $\vec{\beta}=[\beta_1,…,\beta_n]$ are the coefficients for each feature, and $\vec{y}$ is the measurements, or the true solutions you are trying to solve for.
The matrix equation I have seen to solve for $\beta$ is $$\vec{\beta}=(X^TX)^{-1}X^T\vec{y}$$
I am wondering, why isn't the solution just : $\vec{\beta}=X^{-1}\vec{y}$ ?
Is it because $X$ may be singular, and this is some kind of psuedo inverse?
Best Answer
Yes, that's exactly what you suggested.
Typically $X\in\mathbb{R}^{m\times n}$ is a tall matrix (m>=n) with $rank(X)=n$, because # of data $m$ should be at least greater than # of parameters $n$. Also, in practice most observed matrix attains its maximum rank.
In this situation, $X^\top X$ has full rank, and the pseudo-inverse $X^\dagger$ is exactly $(X^\top X)^{-1}X^\top$.
As a quick check, we have $$((X^\top X)^{-1}X^\top ) X=I$$ and $$X(X^\top X)^{-1}X^\top=\text{projection matrix to the column space of }X$$
To prove for $X\in\mathbb{R}^{m\times n}$, if $rank(X)=n$, then $X^\dagger=(X^\top X)^{-1}X^\top$, let the SVD of X be $U\Sigma V^\top$.
By definition of pseudo-inverse, $X^\dagger=V\Sigma^{-1}U^\top$, where
$$(\Sigma^{-1})_{ij}=\begin{cases} 0 & \text{ if }\Sigma_{ji}=0 \\ 1/\Sigma_{ji} & \text{ o.w. }\end{cases}$$
the (i,j) element of $\Sigma^{-1}$ is zero if the corresponding (j,i) element in $\Sigma$ is; otherwise, it is the reciprocal of it.
Also,
$$(X^\top X)^{-1}X^\top = (V \Sigma^\top \Sigma V^\top)^{-1} V\Sigma^\top U^\top= V (\Sigma^\top \Sigma)^{-1}\Sigma^\top U^\top$$
Finally, by definition of $\Sigma^{-1}$, we have $(\Sigma^\top \Sigma)^{-1}\Sigma^\top=\Sigma^{-1}$, so the proof is complete.