This has, of course, been tried before.
In fact, there is an implementation in ELKI, based on k-means.
The problem is, this isn't as easy as it sounds.
- k-means itself is sensitive to outliers.
- the points inbetween clusters aren't necessarily outliers, but may have a high score
- the points away from all clusters can be more easily found by computing the distance from the data set center.
So in the end, it simply doesn't work.
DBSCAN is a clustering algorithm that has a concept of "noise". Which are primitive outliers - objects in regions of low density.
There are many more - more advanced - outlier detection methods available in ELKI; which worked much better for me than the k-means based thing.
I want to calculate the eigenvectors from the reference image and use
these eigenvectors to reconstruct the experiment image. The theory
being then that the difference between the real experiment image and
the experiment image reconstructed using the reference eigenvectors
should highlight the feature we're interested in.
This is not likely to be as straightforward as you may be hoping, unless your new feature happens to have zero covariance with the training data. This is the only way that the new feature can give zero distortion to the PC scores, since the PC scores are the covariance between the data and the PC eigenvectors. However, you can iteratively Winsorise the most extreme differences between the new data and its reconstruction to reduce the impact of the propagation of new feature covariance into the model PC scores.
The larger the proportion of the new data that is comprised of the new feature, the harder it will be to implement, as otherwise you end up Winsorising most of the new data and have little left to produce a reliable fit.
Also, the more the new feature correlates with the old data, the harder it will be to cleanly isolate. If new feature is like old data but one feature moves 3 pixels then PCA will see it as almost identical with a tiny residual, but other interpretation methods (e.g. database matching, feature detection) can see it as a completely different unique entity with radically different causal and scientific implications.
Can I simply drop in my "clean" eigenvectors or do I need to rescale
them (or the PC scores)? Logically it would seem some rescaling would
be needed somewhere but I'm not 100% clear on how.
The eigenvectors are unit vectors, so are scale free. This means that in principle there is no scaling required pre-reconstruction other than following the same pre-treatment process that was used on the training data. If scaling was used as a pre-treatment then the same mean and standard deviation values used in the training set are used for the pre-processing - these are not recalculated from the new data.
However, it should be noted that post-reconstruction there may need to be a different set of rescaling. In reconstructing the new data using an old model you can no longer assume that the covariance scales are still meaningful. For example, if working with images the original is constrained to [0,255], but there is no such constraint on the reconstruction since the new feature is not constrained to fit the covariance structure of the old data. In such cases, the final reconstruction can be rescaled to bring it back into a useable range.
Best Answer
Yes, you can do this. This method will measure the squared Euclidean distance between a new point and its projection onto the subspace found by PCA. It will give large values for outliers along directions orthogonal to the principal axes (point 1 in the example below), but not to outliers along them (point 2). Insensitivity to this second kind of outlier may be desirable or undesirable, depending on your application. The reconstruction error will give continuous values, so you'd need a way to choose the threshold for what counts as an outlier/anomaly.