Solved – Why are PCA eigenvectors orthogonal but correlated

correlationorthogonalpcar

I've seen some great posts explaining PCA and why under this approach the eigenvectors of a (symmetric) correlation matrix are orthogonal. I also understand the ways to show that such vectors are orthogonal to each other (e.g. taking the cross-products of the matrix of these eigenvectors will result in a matrix with off-diagonal entries that are zero).

My first question is, when you look at the correlations of a PCA's eigenvectors, why are the off-diagonal entries of the correlation matrix non-zero (i.e. how can the eigenvectors be correlated if they are orthogonal)?

This question is not directly about PCA, but I put it in this context since that is how I ran into the issue. I am using R and specifically the psych package to run PCA.

If it helps to have an example, this post on StackOverflow has one that is very convenient and related (also in R). In this post, the author of the best answer shows that the PCA loadings (eigenvectors) are orthogonal by using Factor Congruence or cross-products. In his example, the matrix L is the PCA loadings matrix. The only thing that is not on this link is that cor(L) will produce the output I am asking about showing the non-zero correlations between the eigenvectors.

I am especially confused about how orthogonal vectors can be correlated after reading this post, which seems to prove that orthogonality is equivalent to lack of correlation: Why are PCA eigenvectors orthogonal and what is the relation to the PCA scores being uncorrelated?

My second question is: when the PCA eigenvectors are used to calculate PCA scores, the scores themselves are uncorrelated (as I expected)… is there a connection to my first question about this, why eigenvectors are correlated but not the scores?

Best Answer

Let $X$ be a random vector $X=(x_1,x_2,\cdots,x_d)^T $ with expected value $\mu$ and variance $\Sigma$. We are looking for such ordered vectors $u_i$, that maximize the variance of $u_i^TX$. Essentialy we are solving $$\max\limits_{u_i} Var(u_i^TX)$$ $$s.t. \quad u_i^T u_i=1.$$ Because we are only interested in the direction of such vectors, we are additionally assuming the unit length of vectors $u_i^T u_i=1$. Vectors $u_i$ are actually not random (because we are working theoretically now, in reality we are replacing the unknown $\Sigma$ and unknown $\mu$ with Empirical sample covariance matrix and mean respectively, @whuber was explaining this from a different perspective) so $$Var(u_i^TX)=u_i^T\Sigma u_i.$$ The optimization problem can be trivially solved by using the Lagrange function $$L(u_i,\lambda_i):=u_i^T \Sigma u_i -\lambda_i(u_i^Tu_i-1).$$ From there we get the necessary condition for constrained extrema $$ \frac{\partial L(u_i,\lambda_i)}{\partial u_i} = 2\Sigma u_i -2\lambda_i u_i=0,$$ which can be reduced to $$\Sigma u_i =\lambda_i u_i,$$ that is by definition the problem of eigenvalues and eigenvectors. Because $\Sigma$ is symmetric and positive semidefinite matrix, the spectral theorem applies and we are able to find orthonormal basis that satisfies $\Sigma=Q\Lambda Q^{-1}=Q\Lambda Q^T$, where $Q$ is made of orthogonal eigenvectors and $\Lambda$ is a diagonal matrix with eigenvalues which are all real.

Now we can show that $$cov(u_i^TX,u_j^TX)=u_i^T\Sigma u_j=\lambda_j u_i^Tu_j=0, \quad \forall j \neq i.$$ Trivially for $i=j: \quad cov(u_i^TX,u_j^TX)=\lambda_i.$ So not the eigenvectors, but the projections are uncorrelated.

The setting

We are considering an $n\times k$ model matrix $\mathbb X$ of potential explanatory variables in some kind of regression. This means we're thinking of the columns of $\mathbb X$ as being $n$-vectors $X_1, X_2, \ldots, X_k$ and we will be forming linear combinations of them, $\beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_k X_k,$ to predict or estimate a response.

Sometimes a regression can be improved by introducing additional columns created by multiplying various columns of $X$ by each other, coefficient by coefficient. Such products are called "monomials" and can be written like

$$X_1^{d_1} X_2^{d_2} \cdots X_k^{d_k}$$

where each "power" $d_i$ is zero or greater, representing how many times each $X_1$ appears in the product. Notice that $X^0$ is an $n$-vector of constant coefficients ($1$) and $X^1=X$ itself. Thus, monomials (as vectors) generate a vector space that includes the original column space of $\mathbb X.$ The possibility that it might be a larger vector space gives this procedure greater scope to model the response with linear combinations.

We intend to replace the original model matrix $\mathbb X$ by a collection linear combinations of monomials. When the degree of at least one of these monomials exceeds $1,$ this is called polynomial regression.

Gradings of polynomials

The degree of a monomial is the sum of its powers, $d_1+d_2+\ldots+d_k.$ The degree of a linear combination of monomials (a "polynomial") is the largest degree among the monomial terms with nonzero coefficients. The degree has an intrinsic meaning, because when you change the basis of the original vector space, each vector $X_i$ is newly represented by a linear combination of all the vectors; monomials $X_1^{d_1} X_2^{d_2} \cdots X_k^{d_k}$ thereby become polynomials of the same degree; and consequently the degree of any polynomial is unchanged.

The degree provides a natural "grading" to this polynomial algebra: the vector space generated by all linear combinations of monomials in $X$ of degree up to and including $d+1,$ called the "polynomials of [or up to] degree $d+1$ in $X,$" extends the vector space of polynomials up to degree $d$ in $X.$

Uses of polynomial regression

Often, polynomial regression is exploratory in the sense that we don't know at the outset which monomials to include. The process of creating new model matrices out of monomials and re-fitting the regression may need to be repeated many times, perhaps an astronomical number of times in some machine learning settings.

The chief problems with this approach are

Monomials often introduce problematic amounts of "multicollinearity" in the new model matrix, primarily because powers of a single variable tend to be highly collinear. (Collinearity among powers of two different variables is unpredictable, because it depends on how those variables are related, and therefore is less predictable.)
Changing just a single column of the model matrix, or introducing a new one, or deleting one, may require a "cold restart" of the regression procedure, potentially taking a long time for computation.

The gradings of polynomial algebras provide a way to overcome both problems.

Orthogonal polynomials in one variable

Given a single column vector $X,$ a set of "orthogonal polynomials" for $X$ is a sequence of column vectors $p_0(X), p_1(X), p_2(X),\ldots$ formed as linear combinations of monomials in $X$ alone--that is, as powers of $X$--with the following properties:

For each degree $d=0, 1, 2, \ldots, $ the vectors $p_0(X), p_1(X), \ldots, p_d(X)$ generate the same vector space as $X^0, X^1, \ldots, X^d.$ (Notice that $X^0$ is the $n$-vector of ones and $X^1$ is just $X$ itself.)
The $p_i(X)$ are mutually orthogonal in the sense that for $i\ne j,$ $$p_i(X)^\prime p_j(X) = 0.$$

Usually, the replacement model matrix $$\mathbb{P} = \pmatrix{p_0(X) & p_1(X) & \cdots & p_d(X)}$$ formed from these monomials is chosen to be orthonormal by normalizing its columns to unit length: $$\mathbb{P}^\prime \mathbb{P} = \mathbb{I}_{d+1}.$$ Because the inverse of $\mathbb{P}^\prime \mathbb{P}$ appears in most regression equations and the inverse of the identity matrix $\mathbb{I}_{d+1}$ is itself, this represents a huge computational gain.

Orthonormality very nearly determines the $p_i(X).$ You can see this by construction:

The first polynomial, $p_0(X),$ must be a multiple of the $n$-vector $\mathbf{1}=(1,1,\ldots,1)^\prime$ of unit length. There are only two choices, $\pm \sqrt{1/n}\mathbf{1}.$ It is customary to pick the positive square root.
The second polynomial, $p_1(X),$ must be orthogonal to $\mathbf{1}.$ It can be obtained by regressing $X$ against $\mathbf{1},$ whose solution is the vector of mean values $\hat X = \bar{X}\mathbf{1}.$ If the residuals $\epsilon = X - \hat X$ are not identically zero, they give the only two possible solutions $p_1(X) = \pm \left(1/||\epsilon||\right)\,\epsilon.$

...

Generally, $p_{d+1}(X)$ is obtained by regressing $X^{d+1}$ against $p_0(X), p_1(X), \ldots, p_d(X)$ and rescaling the residuals to be a vector of unit length. There are two choices of sign when the residuals are not all zero. Otherwise, the process ends: it will be fruitless to look at any higher powers of $X.$ (This is a nice theorem but its proof need not distract us here.)

This is the Gram-Schmidt process applied to the intrinsic sequence of vectors $X^0, X^1, \ldots, X_d, \ldots.$ Usually it is computed using a QR decomposition, which is very nearly the same thing but calculated in a numerically stable manner.

This construction yields a sequence of additional columns to consider including in the model matrix. Polynomial regression in one variable therefore usually proceeds by adding elements of this sequence one by one, in order, until no further improvement in the regression is obtained. Because each new column is orthogonal to the previous ones, including it does not change any of the previous coefficient estimates. This makes for an efficient and readily interpretable procedure.

Polynomials in multiple variables

Exploratory regression (as well as model fitting) usually proceeds by first considering which (original) variables to include in a model; then assessing whether those variables could be augmented by including various transformations of them, such as monomials; and then introducing "interactions" formed from products of these variables and their re-expressions.

Carrying out such a program, then, would start with forming univariate orthogonal polynomials in the columns of $\mathbb X$ separately. After selecting a suitable degree for each column, you would then introduce interactions.

At this point, parts of the univariate program break down. What sequence of interactions would you apply, one by one, until a suitable model is identified? Moreover, now that we have truly entered the realm of multivariable analysis, the numbers of options available and their growing complexity suggest there may be diminishing returns in constructing a sequence of multivariate orthogonal polynomials. If, however, you had such a sequence in mind, you could compute it using a QR decomposition.

What `R` does

Software for polynomial regression therefore tends to focus on computing univariate orthogonal polynomial sequences. It is characteristic for R to extend such support as automatically as possible to groups of univariate polynomials. This what poly does. (Its companion polym is essentially the same code, with a fewer bells and whistles; the two functions do the same things.)

Specifically, poly will compute a sequence of univariate orthogonal polynomials when given a single vector $X,$ stopping at a specified degree $d.$ (If $d$ is too large--and it can be difficult to predict how large is too large--it unfortunately throws an error.) When given a set of vectors $X_1, \ldots, X_k$ in the form of a matrix $\mathbb X,$ it will return

Sequences of orthonormal polynomials $p_1(X_j), p_2(X_j), \ldots, p_d(X_j)$ for each $j$ out to a requested maximum degree $d.$ (Since the constant vector $p_0(X_i)$ is common to all variables and is so simple--it's usually accommodated by the intercept in the regression--R does not bother to include it.)
All interactions among those orthogonal polynomials up to and including those of degree $d.$

Step (2) involves several subtleties. Usually by an "interaction" among variables we mean "all possible products," but some of those possible products will have degrees greater than $d.$ For instance, with $2$ variables and $d=2,$ R computes

$$p_1(X_1),\quad p_2(X_1),\quad p_1(X_2),\quad p_1(X_1)p_1(X_2),\quad p_2(X_2).$$

R does not include the higher-degree interactions $p_2(X_1)p_1(X_2),$ $p_1(X_1)p_2(X_2)$ (polynomials of degree 3) or $p_1(X_2)p_2(X_2)$ (a polynomial of degree 4). (This is not a serious limitation because you can readily compute these products yourself or specify them in a regression formula object.)

Another subtlety is that no kind of normalization is applied to any of the multivariate products. In the example, the only such product is $p_1(X_1)p_1(X_2).$ However, there is no guarantee even that its mean will be zero and it almost surely will not have unit norm. In this sense it is a true "interaction" between $p_1(X_1)$ and $p_1(X_2)$ and as such can be interpreted as interactions usually are in a regression model.

An example

Let's look at an example. I have randomly generated a matrix $$\mathbb{X} = \pmatrix{1 & 3 \\ 5 & 6 \\ 2 & 4}.$$ To make the calculations easier to follow, everything is rounded to two significant figures for display.

The orthonormal polynomial sequence for the first column $X_1 = (1,5,2)^\prime$ begins by normalizing $X_1^0= (1,1,1)^\prime$ to unit length, giving $p_0(X_1) = (1,1,1)^\prime/\sqrt{3} \approx(0.58,0.58,0.58)^\prime.$ The next step includes $X_1^1 = X_1$ itself. To make it orthogonal to $p_0(X_1),$ regress $X_1$ against $p_0(X_1)$ and set $p_1(X_1)$ equal to the residuals of that regression, rescaled to have unit length. The result is the usual standardization of $X_1$ obtained by recentering it and dividing by its standard deviation, $p_1(X_1) = (-0.57,0.79,-0.23)^\prime.$ Finally, $X_1^2 = (1,25,4)$ is regressed against $p_0(X_1)$ and $p_1(X_1)$ and those residuals are rescaled to unit length. We cannot go any further because the powers of $X_1$ cannot generate a vector space of more than $n=3$ dimensions. (We got this far because the minimal polynomial of the coefficients of $X_1,$ namely $(t-1)(t-5)(t-4),$ has degree $3,$ demonstrating that all monomials of degree $3$ or larger are linear combinations of lower powers and those lower powers are linearly independent.)

The resulting matrix representing an orthonormal polynomial sequence for $X_1$ is

$$\mathbb{P_1} = \pmatrix{0.58 & -0.57 & 0.59 \\ 0.58 & 0.79 & 0.20 \\ 0.58 & -0.23 & -0.78}$$

(to two significant figures).

In the same fashion, an orthonormal polynomial matrix for $X_2$ is

$$\mathbb{P_2} = \pmatrix{0.58 & -0.62 & 0.53 \\ 0.58 & 0.77 & 0.27 \\ 0.58 & -0.15 & -0.80}.$$

The interaction term is the product of the middle columns of these matrices, equal to $(0.35, 0.61, 0.035)^\prime.$ The full matrix created by poly or polym, then, is

$$\mathbb{P} = \pmatrix{-0.57 & 0.59 & -0.62 & 0.35 & 0.53 \\ 0.79 & 0.20&0.77& 0.61& 0.27 \\ -0.23 & -0.78 & -0.15 & 0.035 & -0.80}.$$

Notice the sequence in which the columns are laid out: the non-constant orthonormal polynomials for $X_1$ are in columns 1 and 2 while those for $X_2$ are in columns 3 and 5. Thus, the only orthogonality that is guaranteed in this output is between these two pairs of columns. This is reflected in the calculation of $\mathbb{P}^\prime\mathbb{P},$ which will have zeros in positions $(1,2), (2,1), (3,5),$ and $(5,3)$ (shown in red below), *but may be nonzero anywhere else, and will have ones in positions $(1,1), (2,2), (3,3),$ and $(5,5)$ (shown in blue below), but is likely not to have a one in the other diagonal positions ($(4,4)$ in this example). Indeed,

$$\mathbb{P}^\prime\,\mathbb{P} = \pmatrix{\color{blue}{\bf 1} & \color{red}{\bf 0} & 1 & 0.28 & 0.091 \\ \color{red}{\bf 0} & \color{blue}{\bf 1} & -0.091 & 0.3 & 1 \\ 1 & -0.091 & \color{blue}{\bf 1} & 0.25 & \color{red}{\bf 0} \\ 0.28 & 0.3 & 0.25 & 0.5 & 0.32 \\ 0.091 & 1 & \color{red}{\bf 0} & 0.32 & \color{blue}{\bf 1}}.$$

When you inspect the $\mathbb P$ matrix shown in the question, and recognize that multiples of $10^{-17}$ are really zeros, you will observe that this pattern of zeros in the red positions holds. This is the sense in which those bivariate polynomials are "orthogonal."

Best Answer

Related Solutions

Solved – Why are principal component scores uncorrelated

Solved – What are multivariate orthogonal polynomials as computed in R