Solved – How to measure shape of cluster

clusteringunsupervised learning

I know that this question is not well defined, but some clusters tend to be elliptical or lie in lower dimensional space whilst the other have nonlinear shapes (in 2D or 3D examples).

Is there any measure of nonlinearity (or "shape") of clusters?

Note that in 2D and 3D space, it is not a problem to see the shape of any cluster, but in higher dimensional spaces it is problem to say something about shape. In particular, are there any measures of how convex cluster is?

I was inspired for this question by many other clustering questions where people talk about clusters but nobody is able to see them (in higher dimensional spaces). Moreover, I know that there are some measures of nonlinearity for 2D curves.

Best Answer

I like Gaussian Mixture models (GMM's).

One of their features is that, in probit domain, they act like piecewise interpolators. One implication of this is that they can act like a replacement basis, a universal approximator. This means that for non-gaussian distributions, like lognormal, weibull, or crazier non-analytic ones, as long as some criteria are met - the GMM can approximate the distribution.

So if you know the parameters of the AICc or BIC optimal approximation using GMM then you can project that to smaller dimensions. You can rotate it, and look at the principal axes of the components of the approximating GMM.

The consequence would be an informative and visually accessible way to look at the most important parts of higher dimensional data using our 3d-viewing visual perception.

EDIT: (sure thing, whuber)

There are several ways to look at the shape.

  • You can look at trends in the means. A lognormal is approximated by a series of Gaussians whos means get progressively closer and whose weights get smaller along the progression. The sum approximates the heavier tail. In n-dimensions, a sequence of such components would make a lobe. You can track distances between means (convert to high dimension) and direction cosines between as well. This would convert to much more accessible dimensions.
  • You can make a 3d system whose axes are the weight, the magnitude of the mean, and the magnitude of the variance/covariance. If you have a very high cluster-count, this is a way to view them in comparison with each other. It is a valuable way to convert 50k parts with 2k measures each into a few clouds in a 3d space. I can execute process control in that space, if I choose. I like the recursion of using gaussian mixture model based control on components of gaussian mixture model fits to part parameters.
  • In terms of de-cluttering you can throw away by very small weight, or by weight per covariance, or such.
  • You can plot the GMM cloud in terms of BIC, $ R^2$, Mahalanobis distance to components or overall, probability of membership or overall.
  • You could look at it like bubbles intersecting. The location of equal probability (zero Kullback-Leibler divergence) exists between each pair of GMM clusters. If you track that position, you can filter by probability of membership at that location. It will give you points of classification boundaries. This will help you isolate "loners". You can count the number of such boundaries above the threshold per member and get a list of "connectedness" per component. You can also look at angles and distances between locations.
  • You can resample the space using random numbers given the Gaussian PDFs, and then perform principle component analysis on it, and look at the eigen-shapes, and eigenvalues associated with them.

EDIT:

What does shape mean? They say specificity is the soul of all good communication. What do you mean about "measure"?

Ideas about what it can mean:

  • Eyeball norm sense/feels of general form. (extremely qualitative, visual accessibility)
  • measure of GD&T shape (coplanarity, concentricity, etc) (extremely quantitative)
  • something numeric (eigenvalues, covariances, etc...)
  • a useful reduced dimension coordinate (like GMM parameters becoming dimensions)
  • a reduced noise system (smoothed in some way, then presented)

Most of the "several ways" are some variation on these.

Related Question