Solved – K-means: Does it make sense to remove the outliers after clustering the datasets

clusteringk-meansmachine learning

The requirements of the project is to cluster the dataset (using k-means) and then remove the outliers (using MAD) from each of the cluster.

However, I don't feel that it make sense to do that. I think outliers should be removed from the dataset first and then do the clustering.

I'm really new to k-means and machine learning in general. I would really appreciate suggestions. Thanks in advance!

EDIT1: Answering @Tim as to why outliers should be removed:

There are actually 2 process.

running the k-means,
removing the outliers from each cluster

Best Answer

K-means can be quite sensitive to outliers.

So if you think you need to remove them, I would rather remove them first, or use an algorithm that is more robust to noise. For example k medians is more robust and very similar to k-means, or you use DBSCAN.

Consider, for example, this one dimensional dataset: 1 2 3 4 101 102 103 104 10000.

Related Solutions

Solved – Silhouette coefficients after deleting some data and re-clustering

That's a good question. The value of Silhouette index for an object shows how strongly is justified the decision to assign the object to its actual cluster over the decision to assign it to another cluster, closest to it. Value tending to 1 tells of hight justifiedness (well clustered object). Negative value tells that the object should better belong to that another cluster. Value close to zero is characteristic of a "borderline", between the two clusters, object.

In real data even optimal clusterization will leave some objects to be with low positive value because neighbour clusters usually "touch" each other by their borders. Unless the value isn't negative there is no reason to reassign the object (though you may do it, and sometimes it will enhance clusters). Nor there is reason to delete objects with low positive values. Deleting borderline objects may not help: re-clustering after the deletion will redefine clusters and make other points borderline in place of the deleted ones, so you are not guaranteed to better the overall cluster solution. In addition, deleting is a gross intervention in real data, and you must have strong reason to treat intermediate points like outliers.

Also, you should take into consideration that original Silhouette index which you probably use (Kaufman, L., Rousseeuw, P. Finding groups in data: an introduction to cluster analysis. New York, 1990) is based on averaged pairwise distances, whereas K-means clustering tries to minimize deviations from cluster centre. Thus, that index is not very good a judge for K-means. One should re-define terms of the Silhouette index formula to be about deviations from centres - then the index is more appropriate for K-means (if you use SPSS you may find a program to compute such modified Silhouette on my web-page).

Dependency of K-means on the choice of initial cluster centres should also be remembered here, as @user603 points in their comment.

Machine Learning – Run Time Analysis of K-Means Clustering Algorithm

Looking at these notes time complexity of Lloyds algorithm for k-means clustering is given as:

O(n * K * I * d)

n : number of points
K : number of clusters
I : number of iterations
d : number of attributes

My gut feeling is that in your case number of iterations (and number of attributes) is assumed to be constant.

Best Answer

Related Solutions

Solved – Silhouette coefficients after deleting some data and re-clustering

Machine Learning – Run Time Analysis of K-Means Clustering Algorithm

Related Question