Solved – PCA – Reconstruction from a “clean” set of eigenvectors

numpypcapythonsvd

This is a question related to the explanation here on how to reconstruct data from PCs found here:

How to reverse PCA and reconstruct original variables from several principal components?

I have two datasets (spectral imagery data) that are similar but one should contain a feature that the other doesn't (i.e. I have a "clean" image as a reference and a "experiment" image containing a feature not present in the reference).

I want to calculate the eigenvectors from the reference image and use these eigenvectors to reconstruct the experiment image. The theory being then that the difference between the real experiment image and the experiment image reconstructed using the reference eigenvectors should highlight the feature we're interested in.

The thing I don't understand is how this works in practice.

Using this equation from How to reverse PCA and reconstruct original variables from several principal components?:

$$\boxed{\text{PCA reconstruction} = \text{PC scores} \cdot \text{Eigenvectors}^\top + \text{Mean}}$$

Can I simply drop in my "clean" eigenvectors or do I need to rescale them (or the PC scores)? Logically it would seem some rescaling would be needed somewhere but I'm not 100% clear on how.

Best Answer

I want to calculate the eigenvectors from the reference image and use these eigenvectors to reconstruct the experiment image. The theory being then that the difference between the real experiment image and the experiment image reconstructed using the reference eigenvectors should highlight the feature we're interested in.

This is not likely to be as straightforward as you may be hoping, unless your new feature happens to have zero covariance with the training data. This is the only way that the new feature can give zero distortion to the PC scores, since the PC scores are the covariance between the data and the PC eigenvectors. However, you can iteratively Winsorise the most extreme differences between the new data and its reconstruction to reduce the impact of the propagation of new feature covariance into the model PC scores.

The larger the proportion of the new data that is comprised of the new feature, the harder it will be to implement, as otherwise you end up Winsorising most of the new data and have little left to produce a reliable fit.

Also, the more the new feature correlates with the old data, the harder it will be to cleanly isolate. If new feature is like old data but one feature moves 3 pixels then PCA will see it as almost identical with a tiny residual, but other interpretation methods (e.g. database matching, feature detection) can see it as a completely different unique entity with radically different causal and scientific implications.

Can I simply drop in my "clean" eigenvectors or do I need to rescale them (or the PC scores)? Logically it would seem some rescaling would be needed somewhere but I'm not 100% clear on how.

The eigenvectors are unit vectors, so are scale free. This means that in principle there is no scaling required pre-reconstruction other than following the same pre-treatment process that was used on the training data. If scaling was used as a pre-treatment then the same mean and standard deviation values used in the training set are used for the pre-processing - these are not recalculated from the new data.

However, it should be noted that post-reconstruction there may need to be a different set of rescaling. In reconstructing the new data using an old model you can no longer assume that the covariance scales are still meaningful. For example, if working with images the original is constrained to [0,255], but there is no such constraint on the reconstruction since the new feature is not constrained to fit the covariance structure of the old data. In such cases, the final reconstruction can be rescaled to bring it back into a useable range.

Related Question