I'm not a statistician but I realize that the Principal Component Analysis (PCA) goal is to detect the correlation between variables. It is not like linear regression because rather than attempting to predict the values, PCA is attempting to learn about the relationship between variables. What kind of relationships could be detected by PCA, linear or nonlinear?
Principal Component Analysis – linear or nonlinear relationships between variables
statistics
Related Solutions
(I assume for the purposes of this answer that the data has been preprocessed to have zero mean.)
Simply put, the PCA viewpoint requires that one compute the eigenvalues and eigenvectors of the covariance matrix, which is the product $\frac{1}{n-1}\mathbf X\mathbf X^\top$, where $\mathbf X$ is the data matrix. Since the covariance matrix is symmetric, the matrix is diagonalizable, and the eigenvectors can be normalized such that they are orthonormal:
$\frac{1}{n-1}\mathbf X\mathbf X^\top=\frac{1}{n-1}\mathbf W\mathbf D\mathbf W^\top$
On the other hand, applying SVD to the data matrix $\mathbf X$ as follows:
$\mathbf X=\mathbf U\mathbf \Sigma\mathbf V^\top$
and attempting to construct the covariance matrix from this decomposition gives $$ \frac{1}{n-1}\mathbf X\mathbf X^\top =\frac{1}{n-1}(\mathbf U\mathbf \Sigma\mathbf V^\top)(\mathbf U\mathbf \Sigma\mathbf V^\top)^\top = \frac{1}{n-1}(\mathbf U\mathbf \Sigma\mathbf V^\top)(\mathbf V\mathbf \Sigma\mathbf U^\top) $$
and since $\mathbf V$ is an orthogonal matrix ($\mathbf V^\top \mathbf V=\mathbf I$),
$\frac{1}{n-1}\mathbf X\mathbf X^\top=\frac{1}{n-1}\mathbf U\mathbf \Sigma^2 \mathbf U^\top$
and the correspondence is easily seen (the square roots of the eigenvalues of $\mathbf X\mathbf X^\top$ are the singular values of $\mathbf X$, etc.)
In fact, using the SVD to perform PCA makes much better sense numerically than forming the covariance matrix to begin with, since the formation of $\mathbf X\mathbf X^\top$ can cause loss of precision. This is detailed in books on numerical linear algebra, but I'll leave you with an example of a matrix that can be stable SVD'd, but forming $\mathbf X\mathbf X^\top$ can be disastrous, the Läuchli matrix:
$\begin{pmatrix}1&1&1\\ \epsilon&0&0\\0&\epsilon&0\\0&0&\epsilon\end{pmatrix}^\top,$
where $\epsilon$ is a tiny number.
The difference between PCA and regression can be interpreted as being mathematically only one additional multiplication with an inverse matrix...
Here is a correlation matrix with three groups of variables:
group g1: $x_{1,1},x_{1,2}$ correlated
group g2: $x_{2,1},x_{2,2},x_{2,3}$ correlated
group g3: $y$ composed by the variables of group g1 and g2
The correlation matrix looks like
cor x1_1 x1_2 x2_1 x2_2 x2_3 ___y
-----------------------------------------------------------
x1_1 1.000 0.976 0.472 0.444 0.430 0.642
x1_2 0.976 1.000 0.594 0.567 0.538 0.767
x2_1 0.472 0.594 1.000 0.998 0.979 0.919
x2_2 0.444 0.567 0.998 1.000 0.986 0.917
x2_3 0.430 0.538 0.979 0.986 1.000 0.904
___y 0.642 0.767 0.919 0.917 0.904 1.000
-----------------------------------------------------------
This is the loadings-matrix in its initial triangular cholesky form when not yet rotated to PC-position:
tri f1 f2 f3 f4 f5 f6
-----------------------------------------------------------
x1_1 1.000 0.000 0.000 0.000 0.000 0.000
x1_2 0.976 0.218 0.000 0.000 0.000 0.000
x2_1 0.472 0.612 0.634 0.000 0.000 0.000
x2_2 0.444 0.615 0.649 0.054 0.000 0.000
x2_3 0.430 0.543 0.700 0.119 0.128 0.000
___y 0.642 0.644 0.350 0.156 0.112 0.117
-----------------------------------------------------------
In pca we do not distinguish between independent and dependent variables; so we might do a rotation to PCA-position, where then the first axis/column denotes the first principal component and so on.
[22] pca = rot(dr,"pca")
pca f1 f2 f3 f4 f5 f6
-----------------------------------------------------------
x1_1 0.714 -0.692 0.100 0.041 -0.020 0.005
x1_2 0.810 -0.584 -0.031 -0.047 0.025 -0.007
x2_1 0.948 0.300 0.063 -0.081 0.003 0.018
x2_2 0.940 0.333 0.050 -0.049 -0.015 -0.019
x2_3 0.926 0.351 0.072 0.114 0.016 -0.001
___y 0.973 0.045 -0.226 0.027 -0.010 0.004
-----------------------------------------------------------
We see, that all variables have one common factor, but we might also see, that only two or three factors are "relevant". A quartimax-rotation might locate the three main factors better related to the variable-groups
[26] qua = rot(pca,"quartimax",1..6,1..3)
qua.max f1 f2 f3 f4 f5 f6
-----------------------------------+------------------------
x1_1 0.373 0.925 -0.060 0.041 -0.020 0.005
x1_2 0.505 0.859 0.068 -0.047 0.025 -0.007
--------
x2_1 0.988 0.112 -0.059 -0.081 0.003 0.018
x2_2 0.994 0.078 -0.048 -0.049 -0.015 -0.019
x2_3 0.989 0.056 -0.071 0.114 0.016 -0.001
--------
___y 0.908 0.342 0.240 0.027 -0.010 0.004
-----------------------------------+------------------------
We see that we have two main groups with high in-group correlations, saying they measure nearly the same, in the x-variables and a less sharp separated "group" with only the y-variable which is correlated to both groups but has still an individual variance (in factor f3).
This is classical PCA with Quartimax/Varimax-rotation, the "little jiffy"-procedure.
Now we move on to regression. In regression we define one variable as dependent, in our case the variable y. We are interested, in which way y is composed by the independent variables; a still pca-inherent point of view would be that we find the pca of the independent variables only and leve the factor f6, which shows a part of y-variance which is uncorrelated , alone as taken from the initial triangular cholesky-factor.
[29] pca = rot(dr,"pca",1..5)
pca.reg f1 f2 f3 f4 f5 f6
---------------------------------------------------+--------
x1_1 0.729 -0.680 0.067 -0.048 -0.002 0.000
x1_2 0.816 -0.571 -0.067 0.055 0.001 0.000
-------------------
x2_1 0.946 0.315 -0.066 -0.033 0.019 0.000
x2_2 0.936 0.348 -0.041 -0.013 -0.023 0.000
x2_3 0.923 0.366 0.117 0.036 0.004 0.000
---------------------------------------------------+--------
___y 0.957 0.057 -0.076 0.245 -0.039 0.117
---------------------------------------------------+--------
Still the axes show the "factors" and how each variable is composed by that common (f1 to f5) or individual (f6) factors.
Regression asks now for composition of y not by the factors/coordinates on the axes but by the coordinates on variables x if they are taken as axes.
Happily we need only multiply the current loadings-matrix by the inverse of the x-submatrix to get axes defined by the x and have the "loadings" of y on x:
[32] B = (pca[*,1..5]*inv(pca[1..5,1..5]) ) || pca[*,6]
reg.B x1_1 x1_2 x2_1 x2_2 x2_3 ___y
---------------------------------------------------+--------
x1_1 1.000 0.000 0.000 0.000 0.000 0.000
x1_2 0.000 1.000 0.000 0.000 0.000 0.000
-------------------
x2_1 0.000 0.000 1.000 0.000 0.000 0.000
x2_2 0.000 0.000 0.000 1.000 0.000 0.000
x2_3 0.000 0.000 0.000 0.000 1.000 0.000
---------------------------------------------------+--------
___y -1.442 1.989 -1.394 0.955 0.876 0.117
We see, that each of the axes is identified with one of the variables and also the "loadings" of y on that axes. But because the axes show now the directions of the x the loadings of y in the last row are now also the "regression"-weights/ coefficients, and the regression weights are now
B = [-1.442 1.989 -1.394 0.955 0.876 ]
(Because of the strong correlations in the groups the regression-weights are above 1 and also the signs are alternating. But this is not much of concern here in that methodic explanation)
[Update]
Relating PCA and regression in this way, there occurs very fluently another instructive example which might improve intuition. This is the problem of multicollinearity, which if occurs in regression is a problem for the researcher, but if occurs in PCA only improves the validity of estimation of separate components and the loadings of the items on such (latent) constructs.
The means, which I want to introduce here, is the "main direction" of the multicollinear items (which of course is the first principal component) respectively the two sets of independent items $x1$ and $x2$. We can introduce latent variables which mark the pc's of the two sets of x-items. This can practically be done applying pc-rotation with maximizing criterion taken from the sets only:
[32] pc1 = rot(tri,"pca",1..2,1..5) //rotating for pc of items 1..2 only
[33] pc1 = {pc1,marksp(pc1,{1,2})} //add markers for that pc's
PC1 pc1_1 pc1_2 c3 c4 c5 c6
--------------------------------------------------------------
x1_1 0.994 -0.109 0 0 0 0
x1_2 0.994 0.109 0 0 0 0
--------------------------------------------------------------
x2_1 0.562 0.571 0.599 0 0 0
x2_2 0.536 0.578 0.613 0.054 0 0
x2_3 0.521 0.516 0.657 0.123 0.125 0
--------------------------------------------------------------
y 0.722 0.575 0.316 0.151 0.11 0.116
===============================================================
pc1_1 1 0 0 0 0 0
pc1_2 0 1 0 0 0 0
===============================================================
If we were looking at the system of $x1_1,x1_2$ and $y$ only, we had already the beta-values for the pc's of that $x1$-item set as $b_{pc1_1}=0.722$ and $b_{pc1_2}=0.575$ - no hazzle because of (multi)collinearity!
The same can be done with the second set of independent items $x2_1,x2_2,x2_3$:
[37] pc2 = rot(pc1,"pca",3..5,1..5)
[38] pc2 = {pc2,marksp(pc2,{1,2,3})}
PC2 pc2_1 pc2_2 pc2_3 c4 c5 c6
--------------------------------------------------------------
x1_1 0.477 -0.124 -0.464 0.729 0.101 0
x1_2 0.599 -0.189 -0.389 0.636 0.221 0
--------------------------------------------------------------
x2_1 0.997 -0.078 -0.022 0 0 0
x2_2 0.999 -0.041 0.027 0 0 0
x2_3 0.993 0.119 -0.005 0 0 0
--------------------------------------------------------------
y 0.923 -0.029 -0.058 0.212 0.294 0.116
===============================================================
pc1_1 0.542 -0.157 -0.429 0.687 0.162 0
pc1_2 0.557 -0.300 0.343 -0.424 0.55 0
--------------------------------------------------------------
pc2_1 1 0 0 0 0 0
pc2_2 0 1 0 0 0 0
pc2_3 0 0 1 0 0 0
===============================================================
The beta-value for the first pc of the second set of items (in a model without the first set) were $b_{pc2_1}=0.923$ which is more than that $b_{pc1_1}=0.722$ for the first pc of the first set of independent items.
Now to see the betas using the joint model requires again only the inversion of the submatrix of the loadings of the whole set of 5 pc-markers and postmultiplying the first 5 columns with that. This gives us the "loadings", if the 5 pcs are taken as axes of a coordinate system. We get
[42] beta = pca1[*,1..5]* inv(pca1[7..11,1..5]) || pca1[*,6]
beta pc1_1 pc1_2 pc2_1 pc2_2 pc2_3 c6
--------------------------------------------------------------
x1_1 0.994 -0.109 0 0 0 0
x1_2 0.994 0.109 0 0 0 0
--------------------------------------------------------------
x2_1 0 0 0.997 -0.078 -0.022 0
x2_2 0 0 0.999 -0.041 0.027 0
x2_3 0 0 0.993 0.119 -0.005 0
--------------------------------------------------------------
y 0.540 0.375 0.421 0.169 0.045 0.116
===============================================================
pc1_1 1 0 0 0 0 0
pc1_2 0 1 0 0 0 0
--------------------------------------------------------------
pc2_1 0 0 1 0 0 0
pc2_2 0 0 0 1 0 0
pc2_3 0 0 0 0 1 0
===============================================================
In short:
beta pc1_1 pc1_2 | pc2_1 pc2_2 pc2_3 | c6
---------------------------+-------------------------------+--------
y 0.540 0.375 | 0.421 0.169 0.045 | 0.116
===========================+===============================|========
In that joint model the "main direction" of the first set of independents has a beta-weight of 0.540 and the "main direction" of the second set of 0.421 . The value at $c6$ is here only for completeness: its square $0.116^2$ is the unexplained variance of the dependent item $y$.
Best Answer
PCA is inherently linear being based on linear algebra. Of course the principles behind it may be extended to algorithms for nonlinear data, but they differ markedly from what we're used to with PCA.
PCA is used for dimensionality reduction. This could be in the concrete sense of finding 2D planes in 3D data or the more abstract of given a problem over 20 parameters, try to trim it down to say 2 or 3 combinations thereof in order to be managable computationally.
E.g. given a D dimensional problem space, can we apply an orthogonal transformation so that the solution only depends on d