The requirements of the project is to cluster the dataset (using k-means) and then remove the outliers (using MAD) from each of the cluster.
However, I don't feel that it make sense to do that. I think outliers should be removed from the dataset first and then do the clustering.
I'm really new to k-means and machine learning in general. I would really appreciate suggestions. Thanks in advance!
EDIT1: Answering @Tim as to why outliers should be removed:
There are actually 2 process.
-
running the k-means,
-
removing the outliers from each cluster
Best Answer
K-means can be quite sensitive to outliers.
So if you think you need to remove them, I would rather remove them first, or use an algorithm that is more robust to noise. For example k medians is more robust and very similar to k-means, or you use DBSCAN.
Consider, for example, this one dimensional dataset: 1 2 3 4 101 102 103 104 10000.