Solved – K-means: Does it make sense to remove the outliers after clustering the datasets

clusteringk-meansmachine learning

The requirements of the project is to cluster the dataset (using k-means) and then remove the outliers (using MAD) from each of the cluster.

However, I don't feel that it make sense to do that. I think outliers should be removed from the dataset first and then do the clustering.

I'm really new to k-means and machine learning in general. I would really appreciate suggestions. Thanks in advance!

EDIT1: Answering @Tim as to why outliers should be removed:

There are actually 2 process.

  1. running the k-means,

  2. removing the outliers from each cluster

Best Answer

K-means can be quite sensitive to outliers.

So if you think you need to remove them, I would rather remove them first, or use an algorithm that is more robust to noise. For example k medians is more robust and very similar to k-means, or you use DBSCAN.

Consider, for example, this one dimensional dataset: 1 2 3 4 101 102 103 104 10000.