Solved – How to reduce the number of variables in cluster analysis

clusteringfeature selectionprocrustes-analysisrrandom forest

I've got 10 (yes, only 10) cases over 1000 variables (e.g. measurements of concentrations of 1000 different compounds at 10 different time points).
I can group these cases into 3 clusters in 1000-dimensional space (complete linkage, cluster sizes 3, 3, and 4). This partitioning agrees with my expectations, but the clusters are not very well-defined. I suspect that some variables give no or little information, some are noise, and some others are responsible for this particular partitioning. I would like to find out the latter ones, that is, to reduce the number of variables (e.g. to 100-200), so that the cases are partitioned into the same 3 clusters, and these clusters are significantly better defined than the original ones (e.g. by silhouette coefficient).
This should be a subset of the original variables, not some new unobserved ones.

I have the following ideas:

  • Go through the variables one-by-one and compare cluster solutions in each 1-dimensional space to the original solution. Keep only those variables which produce similar results. Not sure if this would work.
  • Go through all the variables in original solution and remove one whose deletion results in the maximum increase in some kind of goodness measure like silhouette coefficient, repeat.
  • Attempt to find out those variables which are responsible for most of the variation, e.g. by doing a multidimensional scaling into a few dimensions, and then fit this back into original 1000 dimensions using procrustes rotation, keeping the ones which fit better. This would only work if only a few variables are responsible for the variation, as I understand.
  • Delete variables with lowest predictor importance?

Would any of this work? Should I try anything else?

Best Answer

The problem with dimensionality reduction and number of variables >> number of observations is that the $k$ observations that you have define an at most $k-1$ dimensional hyperplane on which the objects perfectly are located on.

So yes, anything more than 9 dimensions still has proven redundancies.

Many dimension reduction techniques - in particular PCA, SVD, but probably also MDS etc. - will essentially try to preserve this hyperplane.

Don't you have a way to reduce the number of dimensions that uses domain information that you have? I.e. if you know that your dimensions are expected to be highly correlated, remove dimensions that are the most correlated (pairwise is probably best). But note that even correlation is not very stable to compute when you have just 10 observations. You lose one degree of freedom for the mean, for example, which you can't really afford.