Solved – In PCA, do we have to center and normalize eigenvectors or solely to normalize them

pcaself-study

Given such a matrix about the grades of 6 students in Maths, Computer Sciences and French:

$$A=\begin{bmatrix}
1 & 0 & 0 \\
0 & 0 & 1 \\
0 & 1 & 2 \\
2 & 2 & 1 \\
1 & 0 & 0 \\
2 & 3 & 2
\end{bmatrix}$$

Work out the principal componant analysis.

I found that the eigen vectors were:
$$
v_1=\begin{bmatrix}
1\\
-1\\
1
\end{bmatrix},
v_2=\begin{bmatrix}
1\\
0\\
-1
\end{bmatrix},
v_3=\begin{bmatrix}
1\\
2\\
1
\end{bmatrix}$$

Associated with $\operatorname{Sp}(A)=\{0,\frac{2}{3},2\}$

Thus, the two nontrivial factor axes are $v_2, v_3$ which explains $25\%,75\%$ of the inertia.

I'm then trying to work out the principal component analysis.

After calculating the matrix of centered data $Y=A-g$ with $g =\{1,1,1\}$,

Do I have to center and normalize eigenvectors ("explaining vectors") or solely to normalize them? Why?

That is to say do I get

$$\begin{bmatrix}
\frac{1}{\sqrt{2}} & \frac{1}{\sqrt{6}}\\
0 & \frac{2}{\sqrt{6}}\\
\frac{-1}{\sqrt{2}} & \frac{1}{\sqrt{6}}
\end{bmatrix}$$

or something else and why?

Best Answer

Principal component analysis (PCA) in statistics commonly uses either the covariance matrix $\bf{C}$ or correlation matrix $\bf{R}$ for the variables (attributes). Therefore, for your problem, you would generate the $3 \times 3$ correlation matrix, and then perform eigendecomposition on $\bf{R}$. This is called solving the symmetric eigenvalue problem. The eigenvalues from $\bf{R}$ will essentially explain how many variables are correlated with one another, whereas, if the 3 variables are orthogonal (zero correlation) the eigenvalues should each be close to one. After eigendecomposition of $\bf{R}$, you can determine the determinant as $|\bf{R}|$$=\prod_j \lambda_j$, and the closer $|\bf{R}|$ is to unity the closer $\bf{R}$ is to the identity matrix $\bf{I}$ $\rightarrow$ zeroes in the off-diagonals of $\bf{R}$ and ones on the diagonal.

Without going into detail on the notation for the loadings and the PC scores, the loading matrix $\bf{L}$ will be a $3 \times 3$ matrix, which will the reveal correlation between each of the three variables with the 3 (orthogonal) principal components. I have always used 0.55 as an indicator for high loading. Sometimes variables can also load appreciably (rather than mostly) on more than one PC, and to decrease this tendency you can use a varimax orthogonal rotation -- which will make the variables load mostly on one component. (Computationally, there is also no distinction between a positive or negative sign of each loading, because the sign can change with the algorithm used. Therefore positive loading vs. negative should never be looked at as which sign is correct because you always have to compare signs using different packages).

Lastly, you will generate a $6 \times 3$ matrix $\bf{F}$ of PC scores, which represent you original data points in PC space. The 3 columns of PCs have zero correlation between them and by definition are standard normal distributed $\cal{N}(0,1)$ with mean zero and variance (s.d.) one.

Regarding normalization, the determination of $\bf{R}$ includes mean-zero standardization of your 3 variables, so skewness will be retained if present. Normalization will only rescale to range [0,1], whereas mean-zero standardization will result in a mean of zero for each variable, but not necessarily a variance of unity -- especially if you have skewness. If you want purely standard normal transforms, then calculate the van der Waerden scores for your data points within each variable.

Related Question