Solved – How to identify outliers and conduct robust PCA

outlierspcarobustspss

I want to conduct a principal component analysis (PCA) in SPSS. One assumption for PCA is that there are no significant outliers. How can I identify outliers in SPSS?

Best Answer

Robust PCA is a very active research area, and identifying and removing outliers in a sound way is quite delicate. (I've written two papers in this field, so I do know a bit about it.) While I don't know SPSS, you may be able to implement the relatively simple Algorithm (1) here.

This algorithm (not mine) has rigorous guarantees but requires only some basic computations and a "while" loop. Assuming you are searching for $d$ principal components, the basic procedure is

  1. Compute PCA on your data,
  2. Project your data on to the top $d$ principal components,
  3. Throw away "at random" one of the data points whose projection is "too large", and
  4. Repeat this "a few" times.

Everything in quotation marks is a heuristic; you can find the details in the paper.

The idea behind this procedure is that vectors whose projection after PCA is large may have effected the estimate too much, and so you may want to throw them away. It turns out that choosing the ones to throw away "at random" is actually a reasonable thing to do.

If anyone actually wants to take the time to write the SPSS code for this, I'm sure @cathy would appreciate it.