[Math] Connection between PCA and linear regression

matricesprincipal component analysisregressionstatistical-inferencestatistics

Is there a formal link between linear regression and PCA? The goal of PCA is to decompose a matrix into a linear combination of variables that contain most of the information in the matrix. Suppose for sake of argument that we're doing PCA on an input matrix rather than its covariance matrix, and the columns $X_1, X2, …, X_n$ of the matrix are variables of interest. Then intuitively it seems that the PCA procedure is similar to a linear regression where one uses a linear combination of the variables to predict the entries in the matrix. Is this correct thinking? How can it be made mathematically precise?

Imagine enumerating the (infinite) space of all linear combinations of the variables $X_1, X_2, …,X_n$ of a matrix of data and doing linear regression on each such combination to measure how much of the rows of the matrix the combination can 'explain'. Is there an interpretation of what PCA doing in terms of this operation? I.e. how in this procedure PCA would select the 'best' linear combinations? I realize this procedure is obviously not computationally feasible, I only present it to try to make the link between PCA and linear regression. This procedure works directly with linear combinations of columns of a matrix so it does not require them to be orthogonal.

Best Answer

The difference between PCA and regression can be interpreted as being mathematically only one additional multiplication with an inverse matrix...

Here is a correlation matrix with three groups of variables:

group g1: $x_{1,1},x_{1,2}$ correlated         
group g2: $x_{2,1},x_{2,2},x_{2,3}$ correlated         
group g3: $y$ composed by the variables of group g1 and g2

The correlation matrix looks like

cor         x1_1    x1_2    x2_1    x2_2    x2_3    ___y
-----------------------------------------------------------
x1_1        1.000   0.976   0.472   0.444   0.430   0.642
x1_2        0.976   1.000   0.594   0.567   0.538   0.767
x2_1        0.472   0.594   1.000   0.998   0.979   0.919
x2_2        0.444   0.567   0.998   1.000   0.986   0.917
x2_3        0.430   0.538   0.979   0.986   1.000   0.904
___y        0.642   0.767   0.919   0.917   0.904   1.000
-----------------------------------------------------------

This is the loadings-matrix in its initial triangular cholesky form when not yet rotated to PC-position:

tri         f1       f2      f3      f4      f5      f6
-----------------------------------------------------------
x1_1        1.000   0.000   0.000   0.000   0.000   0.000
x1_2        0.976   0.218   0.000   0.000   0.000   0.000
x2_1        0.472   0.612   0.634   0.000   0.000   0.000
x2_2        0.444   0.615   0.649   0.054   0.000   0.000
x2_3        0.430   0.543   0.700   0.119   0.128   0.000
___y        0.642   0.644   0.350   0.156   0.112   0.117
-----------------------------------------------------------

In pca we do not distinguish between independent and dependent variables; so we might do a rotation to PCA-position, where then the first axis/column denotes the first principal component and so on.

[22] pca = rot(dr,"pca")

pca         f1        f2      f3      f4      f5      f6
-----------------------------------------------------------
x1_1        0.714   -0.692   0.100   0.041  -0.020   0.005
x1_2        0.810   -0.584  -0.031  -0.047   0.025  -0.007
x2_1        0.948    0.300   0.063  -0.081   0.003   0.018
x2_2        0.940    0.333   0.050  -0.049  -0.015  -0.019
x2_3        0.926    0.351   0.072   0.114   0.016  -0.001
___y        0.973    0.045  -0.226   0.027  -0.010   0.004
-----------------------------------------------------------

We see, that all variables have one common factor, but we might also see, that only two or three factors are "relevant". A quartimax-rotation might locate the three main factors better related to the variable-groups

[26] qua = rot(pca,"quartimax",1..6,1..3)

qua.max       f1      f2      f3      f4      f5      f6
-----------------------------------+------------------------
x1_1        0.373   0.925   -0.060   0.041  -0.020   0.005
x1_2        0.505   0.859    0.068  -0.047   0.025  -0.007
--------
x2_1        0.988   0.112   -0.059  -0.081   0.003   0.018
x2_2        0.994   0.078   -0.048  -0.049  -0.015  -0.019
x2_3        0.989   0.056   -0.071   0.114   0.016  -0.001
--------
___y        0.908   0.342    0.240   0.027  -0.010   0.004
-----------------------------------+------------------------

We see that we have two main groups with high in-group correlations, saying they measure nearly the same, in the x-variables and a less sharp separated "group" with only the y-variable which is correlated to both groups but has still an individual variance (in factor f3).
This is classical PCA with Quartimax/Varimax-rotation, the "little jiffy"-procedure.

Now we move on to regression. In regression we define one variable as dependent, in our case the variable y. We are interested, in which way y is composed by the independent variables; a still pca-inherent point of view would be that we find the pca of the independent variables only and leve the factor f6, which shows a part of y-variance which is uncorrelated , alone as taken from the initial triangular cholesky-factor.

[29] pca = rot(dr,"pca",1..5)

pca.reg      f1       f2      f3      f4      f5      f6
---------------------------------------------------+--------
x1_1        0.729   -0.680   0.067  -0.048  -0.002  0.000
x1_2        0.816   -0.571  -0.067   0.055   0.001  0.000
-------------------
x2_1        0.946    0.315  -0.066  -0.033   0.019  0.000
x2_2        0.936    0.348  -0.041  -0.013  -0.023  0.000
x2_3        0.923    0.366   0.117   0.036   0.004  0.000
---------------------------------------------------+--------
___y        0.957    0.057  -0.076   0.245  -0.039  0.117
---------------------------------------------------+--------

Still the axes show the "factors" and how each variable is composed by that common (f1 to f5) or individual (f6) factors.

Regression asks now for composition of y not by the factors/coordinates on the axes but by the coordinates on variables x if they are taken as axes.
Happily we need only multiply the current loadings-matrix by the inverse of the x-submatrix to get axes defined by the x and have the "loadings" of y on x:

[32] B = (pca[*,1..5]*inv(pca[1..5,1..5]) ) || pca[*,6]

reg.B       x1_1    x1_2    x2_1    x2_2    x2_3    ___y
---------------------------------------------------+--------
x1_1        1.000   0.000   0.000   0.000   0.000   0.000
x1_2        0.000   1.000   0.000   0.000   0.000   0.000
-------------------
x2_1        0.000   0.000   1.000   0.000   0.000   0.000
x2_2        0.000   0.000   0.000   1.000   0.000   0.000
x2_3        0.000   0.000   0.000   0.000   1.000   0.000
---------------------------------------------------+--------
___y        -1.442  1.989   -1.394  0.955   0.876   0.117

We see, that each of the axes is identified with one of the variables and also the "loadings" of y on that axes. But because the axes show now the directions of the x the loadings of y in the last row are now also the "regression"-weights/ coefficients, and the regression weights are now

B = [-1.442  1.989   -1.394  0.955   0.876 ]

(Because of the strong correlations in the groups the regression-weights are above 1 and also the signs are alternating. But this is not much of concern here in that methodic explanation)

[Update]

Relating PCA and regression in this way, there occurs very fluently another instructive example which might improve intuition. This is the problem of multicollinearity, which if occurs in regression is a problem for the researcher, but if occurs in PCA only improves the validity of estimation of separate components and the loadings of the items on such (latent) constructs.
The means, which I want to introduce here, is the "main direction" of the multicollinear items (which of course is the first principal component) respectively the two sets of independent items $x1$ and $x2$. We can introduce latent variables which mark the pc's of the two sets of x-items. This can practically be done applying pc-rotation with maximizing criterion taken from the sets only:

[32] pc1 = rot(tri,"pca",1..2,1..5)  //rotating for pc of items 1..2 only
[33] pc1 = {pc1,marksp(pc1,{1,2})}   //add markers for that pc's 


PC1       pc1_1     pc1_2     c3        c4        c5        c6
--------------------------------------------------------------
x1_1      0.994    -0.109     0         0         0         0
x1_2      0.994     0.109     0         0         0         0
--------------------------------------------------------------
x2_1      0.562     0.571     0.599     0         0         0
x2_2      0.536     0.578     0.613     0.054     0         0
x2_3      0.521     0.516     0.657     0.123     0.125     0
--------------------------------------------------------------
y         0.722     0.575     0.316     0.151     0.11      0.116
===============================================================
pc1_1     1         0         0         0         0         0
pc1_2     0         1         0         0         0         0
===============================================================

If we were looking at the system of $x1_1,x1_2$ and $y$ only, we had already the beta-values for the pc's of that $x1$-item set as $b_{pc1_1}=0.722$ and $b_{pc1_2}=0.575$ - no hazzle because of (multi)collinearity!

The same can be done with the second set of independent items $x2_1,x2_2,x2_3$:

[37] pc2 = rot(pc1,"pca",3..5,1..5)                                   
[38] pc2 = {pc2,marksp(pc2,{1,2,3})}                               

PC2       pc2_1     pc2_2     pc2_3     c4        c5        c6
--------------------------------------------------------------
x1_1      0.477    -0.124    -0.464     0.729     0.101     0
x1_2      0.599    -0.189    -0.389     0.636     0.221     0
--------------------------------------------------------------
x2_1      0.997    -0.078    -0.022     0         0         0
x2_2      0.999    -0.041     0.027     0         0         0
x2_3      0.993     0.119    -0.005     0         0         0
--------------------------------------------------------------
y         0.923    -0.029    -0.058     0.212     0.294     0.116
===============================================================
pc1_1     0.542    -0.157    -0.429     0.687     0.162     0
pc1_2     0.557    -0.300     0.343    -0.424     0.55      0
--------------------------------------------------------------
pc2_1     1         0         0         0         0         0
pc2_2     0         1         0         0         0         0
pc2_3     0         0         1         0         0         0
===============================================================

The beta-value for the first pc of the second set of items (in a model without the first set) were $b_{pc2_1}=0.923$ which is more than that $b_{pc1_1}=0.722$ for the first pc of the first set of independent items.

Now to see the betas using the joint model requires again only the inversion of the submatrix of the loadings of the whole set of 5 pc-markers and postmultiplying the first 5 columns with that. This gives us the "loadings", if the 5 pcs are taken as axes of a coordinate system. We get

[42] beta = pca1[*,1..5]* inv(pca1[7..11,1..5]) || pca1[*,6]        

beta      pc1_1     pc1_2     pc2_1     pc2_2     pc2_3     c6
--------------------------------------------------------------
x1_1      0.994    -0.109     0         0         0         0
x1_2      0.994     0.109     0         0         0         0
--------------------------------------------------------------
x2_1      0         0         0.997    -0.078    -0.022     0
x2_2      0         0         0.999    -0.041     0.027     0
x2_3      0         0         0.993     0.119    -0.005     0
--------------------------------------------------------------
y         0.540     0.375     0.421     0.169     0.045     0.116
===============================================================
pc1_1     1         0         0         0         0         0
pc1_2     0         1         0         0         0         0
--------------------------------------------------------------
pc2_1     0         0         1         0         0         0
pc2_2     0         0         0         1         0         0
pc2_3     0         0         0         0         1         0
===============================================================

In short:

beta      pc1_1     pc1_2  |   pc2_1     pc2_2     pc2_3   |  c6
---------------------------+-------------------------------+--------
y         0.540     0.375  |   0.421     0.169     0.045   |  0.116
===========================+===============================|========

In that joint model the "main direction" of the first set of independents has a beta-weight of 0.540 and the "main direction" of the second set of 0.421 . The value at $c6$ is here only for completeness: its square $0.116^2$ is the unexplained variance of the dependent item $y$.

Best Answer

Related Solutions

Principal Component Analysis – linear or nonlinear relationships between variables

PCA and linear combinations

Related Question