Solved – Alternative to Otsu for dividing data into two groups

clusteringmachine learning

I need to be able to automatically divide a dataset into two clusters. There are heuristic reasons to expect the data to have two clusters which would be visually clear if one were to plot the data and in cases I have tested this has panned out. I am familiar with otsu's method for turning a grayscale image into a black and white only image and it seems like one possible approach. My knowledge of it comes from image processing, and I expect there are more standard statistical methods that existed long before that but I just don't know about them. What alternatives are there, particularly that might provide a number that qualifies as a rank of "how divided" the two clusters are and can also be used to determine cases when the clusters fail to exist.

Note
After looking into the Jenks algorithm proposed in the answer, I found that the classInt package in R apparently has a number of such algorithms. I post a note from its documentation to expand on the answer below. I have no idea how well these perform in practice, I post them just because of the variety of possibilities and because being in R makes them easy to try out for yourself.

The fixed style permits a "classIntervals" object to be specified with given breaks, set in the
fixedBreaks argument; the length of fixedBreaks should be n+1; this style can be used to insert
rounded break values.

The sd style chooses breaks based on pretty of the centred and scaled variables, and may have a
number of classes different from n; the returned par= includes the centre and scale values.

The equal style divides the range of the variable into n parts.

The pretty style chooses a number of breaks not necessarily equal to n using pretty, but likely
to be legible; arguments to pretty may be passed through ….

The quantile style provides quantile breaks; arguments to quantile may be passed through ….

The kmeans style uses kmeans to generate the breaks; it may be anchored using set.seed; the
pars attribute returns the kmeans object generated; if kmeans fails, a jittered input vector containing
rtimes replications of var is tried — with few unique values in var, this can prove necessary;
arguments to kmeans may be passed through ….

The hclust style uses hclust to generate the breaks using hierarchical clustering; the pars attribute
returns the hclust object generated, and can be used to find other breaks using getHclustClassIntervals;
arguments to hclust may be passed through ….

The bclust style uses bclust to generate the breaks using bagged clustering; it may be anchored
using set.seed; the pars attribute returns the bclust object generated, and can be used to find other
breaks using getBclustClassIntervals; if bclust fails, a jittered input vector containing rtimes
replications of var is tried — with few unique values in var, this can prove necessary; arguments
to bclust may be passed through ….

The fisher style uses the algorithm proposed by W. D. Fisher (1958) and discussed by Slocum et
al. (2005) as the Fisher-Jenks algorithm; added here thanks to Hisaji Ono.

The jenks style has been ported from Jenks’ Basic code, and has been checked for consistency
with ArcView, ArcGIS, and MapInfo (with some remaining differences); added here thanks to
Hisaji Ono; note that the sense of interval closure is reversed from the other styles, and in this
implementation has to be right-closed – use cutlabels=TRUE downstream for clarity.

Best Answer

Have a look at natural breaks optimization.

https://en.wikipedia.org/wiki/Jenks_natural_breaks_optimization

The term "clustering" is mostly used for multidimensional data.

Stay away from k-means. It's popular, but usually not appropriate for 1d data. 1d data can be handled much more clever, as it obviously is ordered. K-means algorithms (Lloyd, MacQueen) do not use the orderedness, and will test various non-sensical (non-continuous) combinations.

But here is another quite easy method: do a kernel density estimation. Locate local minima, and use these for splitting your data.

Related Question