Solved – Dividing data into two groups as similar as possible

clusteringcombinatoricsoptimizationpartitioning

In this simplified example, I have five variables for each individual. The data is normalized, with mean 0 and sd 1 in each variable.

I would like to divide this data into two groups with equal size and I'd like the mean for each variable to be as close to zero as possible (because since data is normalized, this would mean that the groups are as similar as possible).

I don't know any theory about that kind of problem. My only solution so far would be to randomly divide them millions of times, and selecting the one with smaller distance between the means of each group.

Could you point me to what kind of theory should I read to face this problem in a more optimal way?

One example of data like this in R would be:

set.seed(123)
A <- matrix(runif(100), 20, 5)
A <- scale(A) #normalize
rownames(A) <- letters[1:20]
colnames(A) <- paste0("x",1:5)

Best Answer

Please see this question and answer to a very similar question:

https://stackoverflow.com/a/39338999/1060350

Here, there was a requirement that nearby pairs of points should be formed and put into separate groups (avoid the term cluster when they aren't regular clusters). But this should even further improve your requirement.

A similar, surprising, heuristic for your aim of attempting to have 0 mean is this:

  1. Sort points by the distance from the center (here 0).
  2. Choose the farthest point $x$ in this list, and the point $y$ which minimizes $||x+y||_2$
  3. Assign both randomly to cluster A or B
  4. Repeat 2 (only), but assign them to the other cluster.
  5. If there are more than 3 points remaining, go back to 2
  6. Assign the remaining points according to your size constraints and use them to slightly improve the means to your liking.

This is only a heuristic - it will not find the optimum solution - but probsbly the problem is NP or exponential, so you can't afford to search for the optimum.

Above greedy approach should find a decent solution, because it tries to identify the hardest points first, then chooses the best counterbalance each.

If you need better results, you can modify step 3 to not choose at random, but instead assign points to clusters as to minimize how much the cluster center deviates from 0. This could also be used in step 2: instead of minimizing $||x+y||_2$ you could minimize $||C+x+y||_2$ where C are all the objects already in the cluster. These tweaks make the code slightly more complicated, but require not much computational overhead.

Related Question