Principal Component Analysis Linear Transformation

linear algebraprincipal component analysisstatistics

I am trying to better understand Principal Component Analysis. It is my understanding that our covariance matrix $S\in M_n(R)$ is Symmetric therefore Normal. Due to its normality this implies that it is unitarily Diagonalizable that is

$$S=V^{'}DV$$

Where $V$ is a Unitary Matrix of Eigenvectors and $D$ is a diagonal matrix with the corresponding eigenvalues. Eigenvectors point in the direction of the linear transformation of a matrix, but what is the linear transformation imposed by S. Do I just consider the covariance matrix as the transformation of the standard basis into the columns of S or as a change of coordinates from the basis formed by the columns of S to the standard coordinates $\{E_1, E_2, …., E_n\}$? How do I interpret the columns of S? My intuition tells me that the linear transformation will take vectors in $R^n$ towards the axis that contains the most variability. The matrix

$$M=\begin{bmatrix}3&1\\ 1&10\end{bmatrix}$$

will transform vectors towards the $y$ axis.

Secondly, It is my understanding since the trace of our matrix $S$ is the sum of the variances of our data and the sum of our eigenvalues. If we choose the eigenvalues that capture the majority of the trace of the matrix $S$ then we can capture the majority of the variability in our data. If I Consider a matrix $S\in M_3$ and the first 2 vectors explain the most variability then does this imply that the majority of the spread of my data is found in the first two axis (x and y) and on the third axis (z) the data is highly concentrated around its mean. That is my data can be reduced to two dimensions without much loss in the shape of my data. If my intuition above, about the interpretation of the linear transformation of $S$, is correct then the eigenvectors will point towards the axis that have the most variability. So projection onto the first two eigenvectors will be in the directions that capture most of the variability since the eigenvalues can be seen to represent the volume of the variability.

If I only consider the first two eigenvectors $V=[V_1,V_2]$ then the span of the columns will be a subspace of $R^3$. Then with the projection Matrix $V(V^{'}V)^{-1}V^{'}$ I can project my data onto a plane(the column space of V). Note because of orthogonality the projection could have been done one eigenvector at a time. If I am wrong please correct me. Therefore we project onto the eigenvectors since they point in the direction that explains most of our variability.

Best Answer

You might find the following Python simulation of correlated data helpful to get some insight into the relationship of the covariance matrix and its eigenvectors. It is based on a $ 2 \times 2$ covariance matrix, but should help in working through your arguments. The image below was generated using the simulation code. I hope this helps.

enter image description here

# Simulate some correlated data and compare to eigenvectors.
#
import random as rn
rn.seed(1234)
alpha=0.3 # coupling factor between x1 and x2
sigma1 = 5 # standard deviation of x1
sigma2 = 1 # standard deviation of x2
nTrials=100000

x1_list=[]
x2_list=[]

# Generate sequence of correlated data
for iTrial in range(nTrials):
    x1 = rn.normalvariate(0,sigma1)
    x2 = rn.normalvariate(0,sigma2)+alpha*x1
    x1_list.append(x1)
    x2_list.append(x2)

# Plot results
import numpy as np
import math
import matplotlib.pyplot as plt

#correlation coefficient
# theory
print('corr coeff: theory = ', alpha*sigma1/math.sqrt(alpha**2*sigma1**2+sigma2**2))
# calculated:
print('corr coeff: calc = ', np.corrcoef(x1_list,x2_list))
# covariance:
myCov = np.cov(x1_list,x2_list)
print('covariance = ', myCov)

axis_range=30
plt.xlim(-axis_range, axis_range)
plt.ylim(-axis_range, axis_range)
plt.gca().set_aspect('equal', adjustable='box')
plt.plot(x1_list,x2_list,'.')

#
# determine major eigenvector
#

C11=myCov[0,0]
C12=myCov[0,1]
C22=myCov[1,1]

print('----')
print('C11 = ', C11)
print('C12 = ', C12)
print('C22 = ', C22)

# Calculate eigenvalues

b = -(C11+C22)
c = C11*C22-C12*C12

lamb1 = (-b+math.sqrt(b*b-4*c))/2
lamb2 = (-b-math.sqrt(b*b-4*c))/2

print('----')
print('lamb1 = ', lamb1)
print('lamb2 = ', lamb2)

#
# Determine eigenvector for a matrix of the form:
# 
# a b
# c d
#
# which is for the (symmetric) covariance matrix with assignments:
#
# a=C11 b=C12
# c=C12 d=C22
#
# yielding eigenvector components:
#
# x1=-b/(a-lambda)
# x2=1 
#
ev1 = -C12/(C11-lamb1)
ev2 = -C12/(C11-lamb2)
print('----')
print('ev1 = ', ev1)
print('ev2 = ', ev2)
#
# overplot eigenvector associated with largest eigenvalue
#
# recall that our eigenvector column components are:
#  
#   x1
#   x2
#
# and we plot the x1 (first component) on the x-axis 
# and x2 (second component) on the y-axis in the figures. 
#

x2_vec = np.array([0.0,25.0])
x1_vec = ev1*x2_vec

plt.plot(x1_vec,x2_vec,'-')
plt.legend(['simulated data','principal eigenvector direction'])
plt.xlabel('x1')
plt.ylabel('x2')
plt.show()
Related Question