Machine Learning – Why is PCA Sensitive to Outliers?

machine learningoutlierspca

There are many posts on this SE that discuss robust approaches to principal component analysis (PCA), but I cannot find a single good explanation of why PCA is sensitive to outliers in the first place.

Best Answer

One of the reasons is that PCA can be thought as low-rank decomposition of the data that minimizes the sum of $L_2$ norms of the residuals of the decomposition. I.e. if $Y$ is your data ($m$ vectors of $n$ dimensions), and $X$ is the PCA basis ($k$ vectors of $n$ dimensions), then the decomposition will strictly minimize $$\lVert Y-XA \rVert^2_F = \sum_{j=1}^{m} \lVert Y_j - X A_{j.} \rVert^2 $$ Here $A$ is the matrix of coefficients of PCA decomposition and $\lVert \cdot \rVert_F$ is a Frobenius norm of the matrix

Because the PCA minimizes the $L_2$ norms (i.e. quadratic norms) it has the same issues a least-squares or fitting a Gaussian by being sensitive to outliers. Because of the squaring of deviations from the outliers, they will dominate the total norm and therefore will drive the PCA components.

Related Solutions

Solved – How to identify outliers and conduct robust PCA

Robust PCA is a very active research area, and identifying and removing outliers in a sound way is quite delicate. (I've written two papers in this field, so I do know a bit about it.) While I don't know SPSS, you may be able to implement the relatively simple Algorithm (1) here.

This algorithm (not mine) has rigorous guarantees but requires only some basic computations and a "while" loop. Assuming you are searching for $d$ principal components, the basic procedure is

Compute PCA on your data,
Project your data on to the top $d$ principal components,
Throw away "at random" one of the data points whose projection is "too large", and
Repeat this "a few" times.

Everything in quotation marks is a heuristic; you can find the details in the paper.

The idea behind this procedure is that vectors whose projection after PCA is large may have effected the estimate too much, and so you may want to throw them away. It turns out that choosing the ones to throw away "at random" is actually a reasonable thing to do.

If anyone actually wants to take the time to write the SPSS code for this, I'm sure @cathy would appreciate it.

PCA Loadings – How to Interpret PCA Loadings

Loadings (which should not be confused with eigenvectors) have the following properties:

Their sums of squares within each component are the eigenvalues (components' variances).
Loadings are coefficients in linear combination predicting a variable by the (standardized) components.

You extracted 2 first PCs out of 4. Matrix of loadings $\bf A$ and the eigenvalues:

A (loadings)
         PC1           PC2
X1   .5000000000   .5000000000 
X2   .5000000000   .5000000000 
X3   .5000000000  -.5000000000 
X4   .5000000000  -.5000000000
Eigenvalues:
    1.0000000000  1.0000000000

In this instance, both eigenvalues are equal. It is a rare case in real world, it says that PC1 and PC2 are of equal explanatory "strength".

Suppose you also computed component values, Nx2 matrix $\bf C$, and you z-standardized (mean=0, st. dev.=1) them within each column. Then (as point 2 above says), $\bf \hat {X}=CA'$. But, because you left only 2 PCs out of 4 (you lack 2 more columns in $\bf A$) the restored data values $\bf \hat {X}$ are not exact, - there is an error (if eigenvalues 3, 4 are not zero).

OK. What are the coefficients to predict components by variables? Clearly, if $\bf A$ were full 4x4, these would be $\bf B=(A^{-1})'$. With non-square loading matrix, we may compute them as $\bf B= A \cdot diag(eigenvalues)^{-1}=(A^+)'$, where diag(eigenvalues) is the square diagonal matrix with the eigenvalues on its diagonal, and + superscript denotes pseudoinverse. In your case:

diag(eigenvalues):
1 0
0 1

B (coefficients to predict components by original variables):
    PC1           PC2
X1 .5000000000   .5000000000 
X2 .5000000000   .5000000000 
X3 .5000000000  -.5000000000 
X4 .5000000000  -.5000000000

So, if $\bf X$ is Nx4 matrix of original centered variables (or standardized variables, if you are doing PCA based on correlations rather than covariances), then $\bf C=XB$; $\bf C$ are standardized principal component scores. Which in your example is:

PC1 = 0.5*X1 + 0.5*X2 + 0.5*X3 + 0.5*X4 ~ (X1+X2+X3+X4)/4

"the first component is proportional to the average score"

PC2 = 0.5*X1 + 0.5*X2 - 0.5*X3 - 0.5*X4 = (0.5*X1 + 0.5*X2) - (0.5*X3 + 0.5*X4)

"the second component measures the difference between the first pair of scores and the second pair of scores"

In this example it appeared that $\bf B=A$, but in general case they are different.

Note: The above formula for the coefficients to compute component scores, $\bf B= A \cdot diag(eigenvalues)^{-1}$, is equivalent to $\bf B=R^{-1}A$, with $\bf R$ being the covariance (or correlation) matrix of variables. The latter formula comes directly from linear regression theory. The two formulas are equivalent within PCA context only. In factor analysis, they are not and to compute factor scores (which are always approximate in FA) one should rely on the second formula.

Best Answer

Related Solutions

Solved – How to identify outliers and conduct robust PCA

PCA Loadings – How to Interpret PCA Loadings

Related Question