this is my first post ;-)
K-means is not the ideal method for clustering outliers.
Seethe following example:
In k-means the number of clusters is fixed (i.e. a hyperparameter of that algorithm). You want outliers to be clustered into a bin which is different from the rest of the data so your choice of k needs to be just right for the number of outliers in your data. In the left and the middle example k is chosen well to capture the outliers nicely. The right example groups one outlire into a larger cluster because it has the lowest distance from any of the clusters.
To overcome this problem you may want to repeat k-means with different values of k and minimize the intra-cluster distance. This is a fair enough approach known as the Ellbow Method. Please see wikipedia for more details on that.
A more elegant solution however is to use a hierarchical clustering algorithm like for example the Agglomerative Clustering. Hierarchical Clustering algorithms are genrally better for clustering outliers and aberrant data. The Agglomerative Clustering starts with every data point being its own cluster and merges clusters which are close enough to each other.
In most implementations you can select both the distance metric (e.g. eucledian) and the distance threshold as hyperparameter. Like for example in the sklearn package in python.
from sklearn.cluster import AgglomerativeClustering
ac = AgglomerativeClustering(distance_threshold=.5)
clusters = ac.fit(data)
print(clusters)
The sklearn implementation offers more hyperparameters to tailor the algorithm to your needs but in my opinion the distance_threshold is the most important for your application.
Hope this helps.
Best,
Chris
Best Answer
Gaussian Mixture Models allow assigning a probability to each datapoint of beeing created by one of k gaussian distributions.
These are normalized to sum up to one, allowing interpretation as "Which cluster is most probably responsible for this datapoint?"
If you do not normalize, you have absolute probabilities which estimate how probable a point is - given a specific gaussian mixture model.
Then you can simply define an outlier such as: If p < 0.05 for each cluster, then the point is an outlier.
Yet be warned, the expectation maximization algorithm for gaussian mixture models - which you will need to get best parameters for your gaussian mixture model - is not very robust and tends to find suboptimal solution.
For reading more - especially unterstanding more - I recommend Bishop: Pattern Recognition and Machine Learning