Solved – Why are principal component scores uncorrelated

correlationlinear algebrapca

Supose $\mathbf A$ is a matrix of mean-centred data. The matrix $\mathbf S=\text{cov}(\mathbf A)$ is $m\times m$, has $m$ distinct eigenvalues, and eigenvectors $\mathbf s_1$, $\mathbf s_2$ … $\mathbf s_m$, which are orthogonal.

The $i$-th principal component (some people call them "scores") is the vector
$\mathbf z_i = \mathbf A\mathbf s_i$. In other words, it's a linear combination of the columns of $\mathbf A$, where the coefficients are the components of the $i$-th eigenvector of $\mathbf S$.

I don't understand why $\mathbf z_i$ and $\mathbf z_j$ turn out to be uncorrelated for all $i\neq j$. Does it follow from the fact that $\mathbf s_i$ and $\mathbf s_j$ are orthogonal? Surely not, because I can easily find a matrix $\mathbf B$ and a pair of orthogonal vectors $\mathbf x, \mathbf y$ such that $\mathbf B\mathbf x$ and $\mathbf B\mathbf y$ are correlated.

Best Answer

$$\mathbf z_i^\top \mathbf z_j = (\mathbf A\mathbf s_i)^\top (\mathbf A\mathbf s_j) = \mathbf s_i^\top \mathbf A^\top \mathbf A \mathbf s_j = (n-1) \mathbf s_i^\top \mathbf S \mathbf s_j = (n-1) \mathbf s_i \lambda_j \mathbf s_j = (n-1) \lambda_j \mathbf s_i \mathbf s_j = 0.$$

Related Solutions

Solved – Interpreting Principal Component Analysis output

You appear to be assuming that the largest eigenvalue necessarily can be paired with the largest coefficient within the eigenvectors. That would be wrong.

The question clearly transcends software choice. Here is a fairly silly PCA on five measures of car size using Stata's auto dataset. I used a correlation matrix as starting point, the only sensible option given quite different units of measurement.

. pca headroom trunk weight length displacement

Principal components/correlation                  Number of obs    =        74
                                                  Number of comp.  =         5
                                                  Trace            =         5
Rotation: (unrotated = principal)                 Rho              =    1.0000

--------------------------------------------------------------------------
   Component |   Eigenvalue   Difference         Proportion   Cumulative
-------------+------------------------------------------------------------
       Comp1 |      3.76201        3.026             0.7524       0.7524
       Comp2 |      .736006      .427915             0.1472       0.8996
       Comp3 |      .308091      .155465             0.0616       0.9612
       Comp4 |      .152627      .111357             0.0305       0.9917
       Comp5 |     .0412693            .             0.0083       1.0000
--------------------------------------------------------------------------

Principal components (eigenvectors) 

----------------------------------------------------------------
    Variable |    Comp1     Comp2     Comp3     Comp4     Comp5 
-------------+--------------------------------------------------
    headroom |   0.3587    0.7640    0.5224   -0.1209    0.0130 
       trunk |   0.4334    0.3665   -0.7676    0.2914    0.0612 
      weight |   0.4842   -0.3329    0.0737   -0.2669    0.7603 
      length |   0.4863   -0.2372   -0.1050   -0.5745   -0.6051 
displacement |   0.4610   -0.3390    0.3484    0.7065   -0.2279 
----------------------------------------------------------------

(Output truncated.)

The first component picks up on the fact that as all variables are measures of size, they are well correlated. So to first approximation the coefficients are equal; that's to be expected when all the variables hang together. The remaining components in effect pick up the idiosyncratic contribution of each of the original variables. That is not inevitable, but it works out quite simply for this example. But, to your point, you can see that the largest coefficients, say those above 0.7 in absolute value, are associated with components 2 to 5. There is nothing to stop the largest coefficient being associated with the last component.

(UPDATE) The eigenvectors are informative, but it is also helpful to calculate the components themselves as new variables and then look at their correlations with the original variables. Here they are:

                 pc1      pc2      pc3      pc4      pc5
    headroom   0.6958   0.6554   0.2900  -0.0472   0.0026
       trunk   0.8405   0.3144  -0.4261   0.1138   0.0124
      weight   0.9392  -0.2856   0.0409  -0.1043   0.1545
      length   0.9432  -0.2035  -0.0583  -0.2245  -0.1229
displacement   0.8942  -0.2909   0.1934   0.2760  -0.0463

Here trunk is the variable most strongly correlated with pc3, but negatively. A story on why that happens would depend on looking at the data and the PCs. I don't care enough about the example to do that here, but it would be good practice.

Although I produced the example with a little prior thought, and it is suitable for your question, and it is based on real data, it is also salutary: interpreting the PCs may be no easier than interpreting something more direct such as scatter plots and correlations. However, not every PCA application depends on ability to interpret PCs as having substantive meaning, and much of the literature warns against doing that in any case. For some purposes, the whole point is a mechanistic reordering of the information in the data.

Solved – Why are PCA eigenvectors orthogonal and what is the relation to the PCA scores being uncorrelated

I will try to explain how orthogonality of $a_1$ and $a_2$ ensures that $y_1$ and $y_2$ be uncorrelated. We want $a_1$ to maximize $Var(y_1)=a_1^T \Sigma a_1$. This will not be achieved unless we constrain $a_1$, in this case by $a_1^T a_1=1$. This optimization calls for the use of a Lagrange Multiplier (it's not too complicated, read about it on Wikipedia). We thus try to maximize \begin{equation} a_1^T \Sigma a_1 - \lambda(a_1^T a_1-1) \end{equation} with respect to both $a_1$ and $\lambda$. Notice that differentiation with respect to $\lambda$ and then equating to $0$ gives our constraint $a_1^T a_1=1$. Differentiation with respect to $a_1$ gives \begin{equation} \Sigma a_1 -\lambda a_1 =0 \end{equation} or \begin{equation} (\Sigma -\lambda I_p)a_1=0 \end{equation} variance of $y_1$ will be maximized by the greatest eigenvalue $\lambda_1$. Thus $\lambda_1 a_1=\Sigma a_1$. Here comes the part that will answer your question. Some elementary calculations using the definition of covariance will show that \begin{equation} Cov(y_1,y_2)=Cov(a^T_1 x,a^T_2 x)=a^T_1\Sigma a_2=a^T_2\Sigma a_1=a^T_2\lambda_1 a_1=\lambda_1 a^T_2 a_1 \end{equation} which will equal $0$ if and only if $a^T_2 a_1=0$.

Best Answer

Related Solutions

Solved – Interpreting Principal Component Analysis output

Solved – Why are PCA eigenvectors orthogonal and what is the relation to the PCA scores being uncorrelated

Related Question