It seems to me that in literature it is assumed that one knows which features / attributes to choose to characterize an item in clustering.
If I have a database with items which have many attributes, how do I know which attributes to choose for a good clustering? Is there any guideline or literature which deals with this problem?
Best Answer
There is a whole research subdomain,
that deals with the problem that for different parts of your data set, different attributes may be useful for clustering.
If you also want to take correlations into account, also have a look at
But be careful, there are two different things dubbed with this term. Make sure you look at publications that deal with correlated features.
Why not k-means
K-means is an optimization problem: minimize variance. However, this is not easily adaptable to subspace clustering. In subspace clustering, you assume that for some points, some attributes are not important. However, if you allow "ignoring" attributes, you can arbitrarily decrease variance by dropping attributes!
Furthermore, in subspace clustering you must assume that some points do not belong to any subspace cluster at all. Ignoring points, again, allows you to "cheat" with variance.
Distances - and variance is the same as squared Euclidean distance; which is why k-means uses a "least (squared Euclidean) distance assignment" - are not comparable across different dimensionality. In the physical world: which is longer, 1 meter, or 1 cubic meter? It does not work. You must not compare distances (and thus, variance) across different subspaces.
In my opinion, for all these reasons, the k-means concept of minimizing variance does not reasonably translate to high-dimensional data, or subspace clustering.