Solved – Is clustering (kmeans) appropriate for partitioning a one-dimensional array

clusteringk-means

I want to group the outcome of a function into 2 (or 3) categories.

I have a function efficiency=f(weight,speed,#refueling_stops) that takes 3 input parameters and the output tells me how "efficient" a truck is.
My goal is to take the most inefficient trucks off the road. However, I don't know which truck to keep and which to reject. In other words, I want to divide all possible output values of my function into the category "keep" or the category "reject" (or the category "between"). Furthermore I have no way to rate how suited my decision was and therefore the point where I draw the line(s) is more or less arbitrary. Nevertheless, I'm looking for a science-based approach to this problem.

Is there a name for this kind of problem?

So far I've stumbled upon clustering (kmeans and natural breaks / Jenks) which is completely new to me. Also I've read that my problem may be similar to converting a color image into black and white (and gray). But I couldn't find out what the current practice for this process is.

Up to now, I've calculated all possible outcomes of my function. The histogram and PDF of the resulting one-dimensional array look like this:
histogram and pdf

Then I partitioned them into 2 (or 3) categories via R:

library(classInt)
x <- read.table("all_possible_outcomes")
classIntervals(b, n=2, style = "kmeans")
classIntervals(b, n=3, style = "kmeans")

Now I'm curious if this approach to my problem is a current method, or if not, what is the best practice for this?
I guess what I'm looking for is some kind of confirmation that it's appropriate to use clustering. If not, what alternatives can you think of?

Best Answer

Clustering in one dimension has some special properties that on occasion have been exploited in customised methods. Often it seems neglected in textbook literature, which concentrates on more general problems. See (for example) the answer (not really the question!) to

How can I group numerical data into naturally forming "brackets"? (e.g. income)

That said, I am sceptical about your inclination to think that you have a clustering problem.

  1. Clustering will often be disappointing when the main characteristic of variation is that it is continuous; it is then being asked to find groups where none are well defined. In your case, given your graph I would worry greatly about the reproducibility of clusters. The estimated pdf in particular will vary greatly with kernel choices; delegating choice to e.g. automated cross-validation solves that problem only if you believe everything that goes into it.

  2. It seems that you want to make, or to guide, a decision, so perhaps that should be more central to your problem formulation.

Related Question