Solved – Feature / attribute selection for k-means or other clustering

clusteringfeature selectionk-meansreferences

It seems to me that in literature it is assumed that one knows which features / attributes to choose to characterize an item in clustering.

If I have a database with items which have many attributes, how do I know which attributes to choose for a good clustering? Is there any guideline or literature which deals with this problem?

Best Answer

There is a whole research subdomain,

Subspace clustering (Wikipedia)

Subspace clustering is the task of detecting all clusters in all subspaces. This means that a point might be a member of multiple clusters, each existing in a different subspace. Subspaces can either be axis-parallel or affine.

that deals with the problem that for different parts of your data set, different attributes may be useful for clustering.

If you also want to take correlations into account, also have a look at

Correlation clustering (Wikipedia)

Correlation clustering also relates to a different task, where correlations among attributes of feature vectors in a high-dimensional space are assumed to exist guiding the clustering process. These correlations may be different in different clusters, thus a global decorrelation cannot reduce this to traditional (uncorrelated) clustering.

But be careful, there are two different things dubbed with this term. Make sure you look at publications that deal with correlated features.

Why not k-means

K-means is an optimization problem: minimize variance. However, this is not easily adaptable to subspace clustering. In subspace clustering, you assume that for some points, some attributes are not important. However, if you allow "ignoring" attributes, you can arbitrarily decrease variance by dropping attributes!

Furthermore, in subspace clustering you must assume that some points do not belong to any subspace cluster at all. Ignoring points, again, allows you to "cheat" with variance.

Distances - and variance is the same as squared Euclidean distance; which is why k-means uses a "least (squared Euclidean) distance assignment" - are not comparable across different dimensionality. In the physical world: which is longer, 1 meter, or 1 cubic meter? It does not work. You must not compare distances (and thus, variance) across different subspaces.

In my opinion, for all these reasons, the k-means concept of minimizing variance does not reasonably translate to high-dimensional data, or subspace clustering.