Solved – How to use PCA to detect outliers

anomaly detectiondimensionality reductionoutlierspca

A PCA will reduce the dimensionality of the original data and construct a subspace generated by eigenvectors of which each represents the (next) highest variance to explain the data.

Let's start at this subspace: A PCA has been performed and we have a look at the according subspace now:

PCA

Now let's assume there are outliers (however where exactly). How can they be detected now?

So far, I know there are two methods:

  • Track the angle(s ?) between the PCs

  • Check the number of PCs

I think both are not robust, because new or more data will probably change the angles without providing an outlier. The number of axes makes more sense but still I can construct situations in my head where new data might cause introducing a new axis without making all the data there outliers.
I thought of

  • using a distance/defined radius to scan for new outliers but I can hardly find according approaches?
    On

  • Why is PCA sensitive to outliers? it is explained why it is sensitive to Outliers, this can probably used as a tool, as well.

In other words: How exactly is PCA used to detect outliers respectively how are they detected after performing the PCA?

Best Answer

One approach is to consider outliers those points that can not be well reconstructed using the principal vectors that you have selected .

The procedure goes like this:

1.Fix two positive numbers , a and b (see the next steps for there meaning an to understand how to select them; to be refined using cross-validation)

2.Compute PCA

3.Keep the principal vectors that are associated with principal values greater than a, say $v_1,v_2,..,v_k$ (this are orthonormal vectors)

4.For each data point compute the reconstruction error using the principal vectors from step 3 . For a data point x, the reconstruction error is: $e = ||x-\sum_{i=1}^{k}w_iv_i||_2$ , where $w_i = v_i^Tx$

5.Output as outliers those data points that have an reconstruction error greater than b.

Update: The procedure capture only "direction" outliers . Additionally , before the first step , a "norm" outliers detection step can be included . This consist in computing the norms of the data points and labeling as outliers those that have a too small or too big norm.

It depends on what an outlier is in your context .

Related Question