Solved – clustering algorithm that can take a maximum distance from any mean as a constraint

clusteringconstraintk-meansr

I am building an analytical tool that depends on being able to take a bunch of 1 dimensional numbers and group them into categories based on how close each number is to the mean of the group. However, instead of specifying the number of groups, I want to specify the maximum distance any individual member can be from the center of it's group.

My specific application is a list of times (from 00:00 to 23:59) and I want to partition the list into groups where no member is more than x minutes from the center of it's group. Other than this constraint, the objective function should be the same as the usual k means algorithm.

I am coding in R so if you know of a package I can use that would be terrific!
Off the top of my head, I can iteratively use the kmeans function and check the maximum distance and stop once it is below my threshold, but this is a pretty ugly solution. I am hoping there is a more direct way.

Best Answer

I feel that the case in 1D shouldn't be too hard, right?

In 1D, an optimal solution (in terms of number of clusters) might be easily found with starting at 0:00 (left). The most left point should be included in a cluster. Logically, the cluster center should be moved as far right as possible. Then iteratively, the same goes for the first point that is now not in the cluster. And so on, until you have all the points. So, no clustering exists that does better, in the sense that it works with fewer clusters.

This is without any penalty induced by being far away from a cluster center. If you want something better, you would at least have to induce a tradeoff, between the number of clusters and the distances until the center.

Perhaps if you want to do something with squares, this should be much like Gaussian Mixtures (https://cran.r-project.org/web/packages/sBIC/vignettes/GaussianMixtures.pdf). But then, the constraint is not hard.