Solved – Clustering variables with outliers

clusteringdata-imputationmultiple-imputationoutlierssas

I am performing a cluster analysis in SAS and some of the variables that I am trying to cluster contain outliers. I've tried to transform the data (log and/or standardize them) but didn't quite work out.

So, for example, let's say I came down to 9 clusters, then one or two clusters will have just one value in them. Deleting outliers is not optional due to the nature of my work.

I want to assign a set of values into groups so that values in the same clusters are more similar (in some sense or another) to each other than to those in other clusters however, if I decrease the number of clusters and basically merge the outliers group with other groups I'm afraid that the merged group won't be homogeneous.

Thus, to sum up my questions, how can I deal with outliers for cluster analysis?

Thanks in advance!

Best Answer

Which clustering algorithm did you try?

k-means is known to not work very well with noise. Hierarchical clustering is very likely to produce single-element clusters. That's outliers, nothing wrong with that. Or you might try DBSCAN, in which the "N" stands for "Noise". That algorithm actually is designed to be able to handle some noise objects. You can look up details on Wikipedia.