Is there a formal link between linear regression and PCA? The goal of PCA is to decompose a matrix into a linear combination of variables that contain most of the information in the matrix. Suppose for sake of argument that we're doing PCA on an input matrix rather than its covariance matrix, and the columns $X_1, X2, …, X_n$ of the matrix are variables of interest. Then intuitively it seems that the PCA procedure is similar to a linear regression where one uses a linear combination of the variables to predict the entries in the matrix. Is this correct thinking? How can it be made mathematically precise?
Imagine enumerating the (infinite) space of all linear combinations of the variables $X_1, X_2, …,X_n$ of a matrix of data and doing linear regression on each such combination to measure how much of the rows of the matrix the combination can 'explain'. Is there an interpretation of what PCA doing in terms of this operation? I.e. how in this procedure PCA would select the 'best' linear combinations? I realize this procedure is obviously not computationally feasible, I only present it to try to make the link between PCA and linear regression. This procedure works directly with linear combinations of columns of a matrix so it does not require them to be orthogonal.
Best Answer
The difference between PCA and regression can be interpreted as being mathematically only one additional multiplication with an inverse matrix...
Here is a correlation matrix with three groups of variables:
The correlation matrix looks like
This is the loadings-matrix in its initial triangular cholesky form when not yet rotated to PC-position:
In pca we do not distinguish between independent and dependent variables; so we might do a rotation to PCA-position, where then the first axis/column denotes the first principal component and so on.
We see, that all variables have one common factor, but we might also see, that only two or three factors are "relevant". A quartimax-rotation might locate the three main factors better related to the variable-groups
We see that we have two main groups with high in-group correlations, saying they measure nearly the same, in the x-variables and a less sharp separated "group" with only the y-variable which is correlated to both groups but has still an individual variance (in factor f3).
This is classical PCA with Quartimax/Varimax-rotation, the "little jiffy"-procedure.
Now we move on to regression. In regression we define one variable as dependent, in our case the variable y. We are interested, in which way y is composed by the independent variables; a still pca-inherent point of view would be that we find the pca of the independent variables only and leve the factor f6, which shows a part of y-variance which is uncorrelated , alone as taken from the initial triangular cholesky-factor.
Still the axes show the "factors" and how each variable is composed by that common (f1 to f5) or individual (f6) factors.
Regression asks now for composition of y not by the factors/coordinates on the axes but by the coordinates on variables x if they are taken as axes.
Happily we need only multiply the current loadings-matrix by the inverse of the x-submatrix to get axes defined by the x and have the "loadings" of y on x:
We see, that each of the axes is identified with one of the variables and also the "loadings" of y on that axes. But because the axes show now the directions of the x the loadings of y in the last row are now also the "regression"-weights/ coefficients, and the regression weights are now
(Because of the strong correlations in the groups the regression-weights are above 1 and also the signs are alternating. But this is not much of concern here in that methodic explanation)
[Update]
Relating PCA and regression in this way, there occurs very fluently another instructive example which might improve intuition. This is the problem of multicollinearity, which if occurs in regression is a problem for the researcher, but if occurs in PCA only improves the validity of estimation of separate components and the loadings of the items on such (latent) constructs.
The means, which I want to introduce here, is the "main direction" of the multicollinear items (which of course is the first principal component) respectively the two sets of independent items $x1$ and $x2$. We can introduce latent variables which mark the pc's of the two sets of x-items. This can practically be done applying pc-rotation with maximizing criterion taken from the sets only:
If we were looking at the system of $x1_1,x1_2$ and $y$ only, we had already the beta-values for the pc's of that $x1$-item set as $b_{pc1_1}=0.722$ and $b_{pc1_2}=0.575$ - no hazzle because of (multi)collinearity!
The same can be done with the second set of independent items $x2_1,x2_2,x2_3$:
The beta-value for the first pc of the second set of items (in a model without the first set) were $b_{pc2_1}=0.923$ which is more than that $b_{pc1_1}=0.722$ for the first pc of the first set of independent items.
Now to see the betas using the joint model requires again only the inversion of the submatrix of the loadings of the whole set of 5 pc-markers and postmultiplying the first 5 columns with that. This gives us the "loadings", if the 5 pcs are taken as axes of a coordinate system. We get
In short:
In that joint model the "main direction" of the first set of independents has a beta-weight of 0.540 and the "main direction" of the second set of 0.421 . The value at $c6$ is here only for completeness: its square $0.116^2$ is the unexplained variance of the dependent item $y$.