Solved – Anomaly detection using PCA reconstruction error

anomaly detectionoutlierspca

I would like to use PCA as a method of anomaly detection, however I'm wondering how this is done exactly (I'm using prcomp in R).

I'm really questioning the approach not the R code itself.
Am I right in thinking I first run PCA on a bunch of data to find the lower dimensional subspace representation using the first $k$ PCs. Then as NEW data becomes available I reconstruct it using the $k$ PCs then examine the error. So if the error blows up I know the new data sample doesn't have the same 'structure' compared with the data used to build the PCs… and therefore it's different somehow… i.e. an anomaly.

Can someone tell me if I'm in the right ballpark with my assumption?

Best Answer

Yes, you can do this. This method will measure the squared Euclidean distance between a new point and its projection onto the subspace found by PCA. It will give large values for outliers along directions orthogonal to the principal axes (point 1 in the example below), but not to outliers along them (point 2). Insensitivity to this second kind of outlier may be desirable or undesirable, depending on your application. The reconstruction error will give continuous values, so you'd need a way to choose the threshold for what counts as an outlier/anomaly.

enter image description here

Related Question