Solved – anomaly detection for clustering data structure

anomaly detectionclusteringmachine learning

One of anomaly detection algorithms is to use multivariate Gaussian to construct a probability density, according to Andrew Ng's coursera lecture.

I previously thought that, the probability density method is supervised learning since we build a probability density and determine the threshold (if smaller, then an outlier) based on the positive samples. But somehow I read somewhere, that anomaly detection could be unsupervised learning by only training negative sample. Or I am totally wrong?

What if data show clustering structures (not a single chunk)? In this case do we resort to unsupervised clustering to construct the density? If yes, how to do it? Are there other systematic ways to discover if such a case exists?

Best Answer

As answered before (don't cross-post duplicates, please!)

You can just use regular GMM and use a manualky chosen threshold on the likelihood to identify outliers. Points that don't fit the model well are outliers. The threshold can be interpreted as a probability, e.g. 99.7% of points should be less than 3.

This works okay as long as your data really is composed of Gaussians.

Furthermore, clustering is fairly expensive. Usually it will be faster to directly use a nonparametric outlier detrctor such as KNN or LOF or LOOP. These are unsupervised.

There are also methods such as one-class SVMs that are supposed to be trained on known outlier free data, i.e., one-class only. If you train such models on data with outliers, it may learn these instances as "normal".

Andrew Ng's Coursera lecture on this extremely one-sided, and ignores most of the work. You need to look at some surveys of outlier detection to get a wider non-ML view.

https://scholar.google.com/scholar?hl=en&q=unsupervised+outlier+detection+survey

Related Question