Solved – statistically good way to split a set of data into (generally) uneven groups

density functiondistributionspartitioningrating

So here's the scenario:
I have about 100 or so "items", each assigned with a number and total guesses… this number is a guess of what people think the item is worth (in terms of money). This number can be an average of the guesses, or the sum of all the guesses.

I want to "split" them into (generally) n uneven groups… so for example, it may turn out that some statistical formula/equation/etc. finds out that item 1, 3, 10, etc. are in group 1, 4, 5, etc. are in group 2, and so on (up to group n).

Is there any statistical formula/equation/method that can achieve this?

I am not sure, but can Discrete Fourier Transforms do this? I've never learned it before but heard that its used to transform data into sinusoids…

Example:

You have 1000 balls, each with are colored a unique color.

You ask random people to look at a some of the balls, and rate it from a number of 1 (bad) to 1000 (good) how much they like it.

At the end of this survey, you know that each ball has their own average rating (not necessarily unique).

Now, you want to split up the balls in such a way that the balls are partitioned into, say, 5 groups (1 = bad, …, 5 = good).

But here's the thing: you can't just split them into 5 equal groups; it could be the case that; what if there are balls that have a big number difference (i.e. ball 851 has an average vote of 700, but ball 852 has an average vote of 800?)? The votes are basically denser in some areas than others.

I think this is the basic idea… I'm not sure what else to add. The partition of the data must be consistent and reproducible.

Best Answer

One solution is cluster analysis which is a set of techniques for grouping things, just as you describe. There are many variations of cluster analysis, but they fall into a couple groups: K means clustering and hierarchical clustering. There is a large literature on both, but, essentially, in k means, you tell the program how many clusters you want. In hierarchical, balls are gradually combined into clusters, and then clusters with clusters, starting with 1000 units and ending with 1. There are various statistics that can be used to judge the best number of clusters.

Related Question