Solved – Grouping Combinatorial Optimization

algorithmsoptimization

I have a real world problem for which I need to create an optimization algorithm.

I have a set A, and a group of sets, let's say 500 sets. I need to find the best combination of them to better match A, by combining in 2 groups. The order is not important. Example:

Sets:

B,C,D,E,F,G,H,J,etc

Possible Solutions:

  • {B,C,K},{D,E}
  • {D,E},{B,C,K}
  • {K,D},{B,C,J}
  • {J},{K,B,C,D}

The objective function creates a set with the union of the sets in the first group, and then subtracts from this set its intersection with the second group. Then it calculates how much this new set matches the set A. I need the combination of sets that better matches A.

I would be grateful if someone could advise me with similar problems, keywords, books, articles or ideas. I have been reading about optimization for the past week, but I can't seem to find a clue where to start.

The problem also seems related to probability, as each set can be seen as $P(B)$, $P(C)$, $P(B\cup C)$, etc.

Edit:

I made a more elaborated question before that had no answers or comments so I opted for a more brief explanation. I'm trying to match the set $A = \{2,4,6\}$. So by combining other sets by union and subtraction of the intersection, it may increase the number of matching values and decrease the number of non-matching values.

\begin{align*}
B &= \{2,5,8\}\\
C &= \{4\}\\
D &= \{5,8\}\\
B\cup C &= \{2,4,5,8\}
\end{align*}

Now subtracting the intersection of D with this new set:

$$
B\cup C – (B\cup C \cap D) = \{2,4,5,8\} – \{5,8\} = \{2,4\}
$$

After this calculation, the new set now matches 66% of the elements in A. The purpose is to find the combination that has more elements matching A and less elements non-matching the elements in A.

Best Answer

Essentially you are asking how to select two groups of sets such that the elements in the first group but not in the second are as similar to some target set $A$ as possible. There are two competing objectives here:

  • True positive rate (TPR): Proportion of all elements in set A that you obtain.
  • False positive rate (FPR): Proportion of all elements not in set A that you obtain.

Just as in classification, it is easy to simultaneously maximize TPR and FPR (select all the sets in the first group) or to obtain TPR and FPR of 0 (select none of the sets in the first group), and you're interested in an intermediate solution that simultaneously obtains a high TPR and low FPR.

I would proceed using mathematical optimization, in particular mixed integer linear optimization. I will use the following notation for the input data:

  • $A$: the target set
  • $m_{ij}$: Whether item $i$ is included in set $j$ (1 or 0)

I will then define the following decision variables, all of which are binary (take values 1 or 0):

  • $X_j$: is set $j$ selected in the first group?
  • $Y_j$: is set $j$ selected in the second group?
  • $x_i$: is item $i$ in the first group?
  • $y_i$: is item $i$ in the second group?
  • $z_i$: is item $i$ selected?

Based on these decision variables, it is easy to obtain formulas for the TPR and FPR that are linear in the decision variables:

\begin{align*} TPR &= \frac{1}{|A|}\sum_{i\in A} z_i \\ FPR &= \frac{1}{|\bar A|}\sum_{i\notin A} z_i \end{align*}

We can make the objective function be a convex combination of the TPR and the negative FPR:

$$ \max \alpha~TPR - (1-\alpha)~FPR $$

By re-solving the optimization problem with $\alpha$ ranging from 0 to 1, we can obtain an efficient frontier trading off TPR and FPR.

We now need a series of constraints to connect the $X_j$ and $Y_j$ decision variables (which sets are in groups 1 and 2) to the other decision variables.

A set is selected in at most one of the two groups:

$$ X_j + Y_j\leq 1~\forall j $$

Connect the $x_i$ variables with the $X_j$ variables:

\begin{align*} x_i &\geq m_{ij}X_j &\forall i,j \\ x_i &\leq \sum_j m_{ij}X_j &\forall i \end{align*}

Similarly connect the $y_i$ variables with the $Y_j$ variables:

\begin{align*} y_i &\geq m_{ij}Y_j &\forall i,j \\ y_i &\leq \sum_j m_{ij}Y_j &\forall i \end{align*}

Connect $z_i$ to $x_i$ and $y_i$ (we want $z_i = x_i(1-y_i)$, but we do this a bit differently to avoid bilinear terms):

\begin{align*} z_i &\geq x_i - y_i &\forall i \\ z_i &\leq x_i &\forall i \\ z_i &\leq 1-y_i &\forall i \end{align*}

There are a few different constraints here, so it's a bit of work to implement this; I've done it in R using the lpSolve package:

library(lpSolve)
opt.with.alpha <- function(A, sets, alpha) {
  I <- sort(unique(c(A, unlist(sets))))
  nI <- length(I)
  J <- seq_along(sets)
  nJ <- length(J)
  nIJ <- length(unlist(sets))
  vars <- rbind(data.frame(type="X", i=NA, j=J),
                data.frame(type="Y", i=NA, j=J),
                data.frame(type="x", i=I, j=NA),
                data.frame(type="y", i=I, j=NA),
                data.frame(type="z", i=I, j=NA))
  xIJ <- matrix(0, nrow=nIJ, ncol=nrow(vars))
  xIJ[cbind(seq_len(nIJ), which(vars$type == "X")[rep(J, sapply(sets, length))])] <- -1
  xIJ[cbind(seq_len(nIJ), which(vars$type == "x")[match(unlist(sets), I)])] <- 1
  xI <- matrix(0, nrow=nI, ncol=nrow(vars))
  xI[cbind(seq_len(nI), which(vars$type == "x"))] <- 1
  xI[cbind(match(unlist(sets), I), which(vars$type == "X")[rep(J, sapply(sets, length))])] <- -1
  yIJ <- matrix(0, nrow=nIJ, ncol=nrow(vars))
  yIJ[cbind(seq_len(nIJ), which(vars$type == "Y")[rep(J, sapply(sets, length))])] <- -1
  yIJ[cbind(seq_len(nIJ), which(vars$type == "y")[match(unlist(sets), I)])] <- 1
  yI <- matrix(0, nrow=nI, ncol=nrow(vars))
  yI[cbind(seq_len(nI), which(vars$type == "y"))] <- 1
  yI[cbind(match(unlist(sets), I), which(vars$type == "Y")[rep(J, sapply(sets, length))])] <- -1
  con <- rbind(t(sapply(J, function(j) vars$type %in% c("X", "Y") & vars$j == j)),
               xIJ, xI, yIJ, yI,
               t(sapply(I, function(i) (vars$type == "z" & vars$i == i) -
                                       (vars$type == "x" & vars$i == i) +
                                       (vars$type == "y" & vars$i == i))),
               t(sapply(I, function(i) (vars$type == "z" & vars$i == i) -
                                       (vars$type == "x" & vars$i == i))),
               t(sapply(I, function(i) (vars$type == "z" & vars$i == i) +
                                       (vars$type == "y" & vars$i == i)))) * 1
  mod <- lp(direction="max",
            objective.in = ifelse(vars$type == "z" & vars$i %in% A, alpha/length(A),
                                  ifelse(vars$type == "z", -(1-alpha)/(nI-length(A)), 0)),
            const.mat=con,
            const.dir=rep(c("<=", ">=", "<=", ">=", "<=", ">=", "<="),
                          c(nJ,   nIJ,  nI,   nIJ,  nI,   nI,   2*nI)),
            const.rhs=rep(c(1, 0, 1), c(nJ, 2*nIJ+4*nI, nI)),
            all.bin = TRUE)
  data.frame(group1 = paste(which(mod$solution[vars$type == "X"] > 0.999), collapse=", "),
             group2 = paste(which(mod$solution[vars$type == "Y"] > 0.999), collapse=", "),
             selectedItems = paste(which(mod$solution[vars$type == "z"] > 0.999), collapse=", "),
             TPR = mean(A %in% which(mod$solution[vars$type == "z"] > 0.999)),
             FPR = mean(setdiff(I, A) %in% which(mod$solution[vars$type == "z"] > 0.999)))
}
opt.sets <- function(A, sets) {
  unique(do.call(rbind, lapply(seq(0, 1, .01), function(alpha) opt.with.alpha(A, sets, alpha=alpha))))
}

Let's now consider a slightly more interesting example than the one presented in the problem:

  • $A = \{1, 2, 3, 5, 7\}$
  • $B = \{1, 6\}$
  • $C = \{1, 5, 6\}$
  • $D = \{1, 2, 4, 5\}$
  • $E = \{1, 3, 4, 6, 8\}$

The code then provides an efficient frontier trading off the TPR and FPR:

A <- c(1, 2, 3, 5, 7)
B <- c(1, 6)
C <- c(1, 5, 6)
D <- c(1, 2, 4, 5)
E <- c(1, 3, 4, 6, 8)
opt.sets(A, list(B, C, D, E))
#   group1 group2       selectedItems TPR       FPR
# 1                                   0.0 0.0000000
# 2   1, 3      4                2, 5 0.4 0.0000000
# 4   2, 3      4                2, 5 0.4 0.0000000
# 8      3                 1, 2, 4, 5 0.6 0.3333333
# 9   3, 4        1, 2, 3, 4, 5, 6, 8 0.8 1.0000000

From the output, we see that there are three different achievable efficient solutions:

  • TPR=0.4, FPR=0: We can achieve this with $B\cup D\setminus E$ or $C\cup D\setminus E$, in both cases obtaining items $\{2, 5\}$.
  • TPR=0.6, FPR=0.333: We can achieve this with $D = \{1, 2, 4, 5\}$.
  • TPR=0.8, FPR=1: We can achieve this, for instance, with $C\cup D = \{1, 2, 3, 4, 5, 6, 8\}$.
Related Question