Solved – anomaly detection with gaussian mixture models

anomaly detectiongaussian mixture distribution

I am new to the topic, and I am trying to understand how it is possible to perform anomaly detection by using gaussian mixture models.
Could you please give me some hints about literature on the topic?

Best Answer

Gaussian Mixture Models allow assigning a probability to each datapoint of beeing created by one of k gaussian distributions.

These are normalized to sum up to one, allowing interpretation as "Which cluster is most probably responsible for this datapoint?"

If you do not normalize, you have absolute probabilities which estimate how probable a point is - given a specific gaussian mixture model.

Then you can simply define an outlier such as: If p < 0.05 for each cluster, then the point is an outlier.

Yet be warned, the expectation maximization algorithm for gaussian mixture models - which you will need to get best parameters for your gaussian mixture model - is not very robust and tends to find suboptimal solution.

For reading more - especially unterstanding more - I recommend Bishop: Pattern Recognition and Machine Learning

Related Solutions

Solved – Anomaly detection on 1D data with multiple gaussian distributions

1) Can't you use classification to determine Abnormality and normal data?
2) LOF gives local outlier factor value and dependent on number of neighbours what you can given. but still u can select only top 10 or 100 observations as outlier. There should not be a case where 99% data points are outlier. increase LOF from 2 to may be 5.
3) There are many other unsupervised techniques for your step 1 -https://machinelearningstories.blogspot.com/2018/07/anomaly-detection-anomaly-detection-by.html I can share some more material if you require.

Unsupervised Anomaly Detection – How To Evaluate Using k-means Clustering

this is my first post ;-)

K-means is not the ideal method for clustering outliers. Seethe following example:

In k-means the number of clusters is fixed (i.e. a hyperparameter of that algorithm). You want outliers to be clustered into a bin which is different from the rest of the data so your choice of k needs to be just right for the number of outliers in your data. In the left and the middle example k is chosen well to capture the outliers nicely. The right example groups one outlire into a larger cluster because it has the lowest distance from any of the clusters.

To overcome this problem you may want to repeat k-means with different values of k and minimize the intra-cluster distance. This is a fair enough approach known as the Ellbow Method. Please see wikipedia for more details on that.

A more elegant solution however is to use a hierarchical clustering algorithm like for example the Agglomerative Clustering. Hierarchical Clustering algorithms are genrally better for clustering outliers and aberrant data. The Agglomerative Clustering starts with every data point being its own cluster and merges clusters which are close enough to each other.

In most implementations you can select both the distance metric (e.g. eucledian) and the distance threshold as hyperparameter. Like for example in the sklearn package in python.

from sklearn.cluster import AgglomerativeClustering
ac = AgglomerativeClustering(distance_threshold=.5)
clusters = ac.fit(data)
print(clusters)

The sklearn implementation offers more hyperparameters to tailor the algorithm to your needs but in my opinion the distance_threshold is the most important for your application.

Hope this helps. Best, Chris

Best Answer

Related Solutions

Solved – Anomaly detection on 1D data with multiple gaussian distributions

Unsupervised Anomaly Detection – How To Evaluate Using k-means Clustering

Related Question