Solved – PCA with anomaly detection

anomaly detectiondimensionality reductionfeature selectionpca

I am developing an algorithm which should find anomalies in a dataset.

In order to reduce computation time I used PCA on the data – reduce number of features will reduce the computation time.

When reviewing it with a colleague a question came up about the impact of PCA in such use case, with the following example:

I have a dataset with n samples and m features (m>1), suppose an anomaly is reflected only in one feature – a feature where the value is always 0 and in the anomalous case it is 1. Our consideration is that the PCA will neglect this feature and when we will reduce the number of columns after the PCA (say we take 95% of data) the anomaly will "disappear".

Is using PCA for finding anomalies discouraged? or we are missing something?

note this might appear as an extreme case as presented here: will-i-miss-anomalies-outliers-due-to-pca but the answers there are not suitable for my case and anyways as far as I understand their anomaly shouldn't be affected by the PCA (appears on all scales).

Best Answer

PCA may be used to reduce your number of features, but it doesn't have to. You will have as many PC's as the number of original features, only that some of them will account for very few of the total variability. That can be visualized in a scree or pareto plot, where the accumulated variance reaches 100% with the last PC. Therefore, you should not be missing any information by using PCA. There is some discussion about this in [Do components of PCA really represent percentage of variance? Can they sum to more than 100%? But then, two contraditory points emerge here:

1) If no reduction in dimensionality is achieved when retaining all PCs, that is if you are care about all the anomalies present in your dataset, and your first goal was to have less features to work with (which will make you lose some information), why use PCA?

2) PCA is generally used when the interest is the "main modes of variability" of your dataset: the first couple of PC's, generally. Small anomalies, as I believe is the case of the ones you pointed out, are expexted to be ignored once only the main components are retained when you consider only the firts PC's for dimensionality reduction.

Hope this helps.

EDIT: The "size" or frequency of the anomalies of one feature are not important by themselves, but you should compare them to the the others in order to know wether they will disappear when you reduce dimensionality. Say, if the variability of this specific anomaly is (quasi-)orthogonal to the first PC's (the ones you use), then you will lose this information. If you are lucky that the mode of variability of the anomalies you are interested in is similar to the main modes of the variability of your entire dataset, then this iformation is kept in the first PC's. There is a nice discussion about this matter here: https://stats.stackexchange.com/a/235107/144543

Related Question