Solved – Python library for combinatorial optimization

combinatoricsoptimization

I've been recently working with a combinatorial optimization problem defined as follows. Given two sets of items, A and B, select the best combination of these items given a scoring function $f(A,B) \rightarrow \mathbb{R}$. The goal is to maximize the output.

As I am new to this field, my initial impression was, to try and code up a stochastic optimizer, such as for example genetic algorithm to solve this. My question is, are there any libraries apart from $\textit{Deap}$, which offer such functionality.

Thank you.

Best Answer

The problem you are trying to solve is a variant of the two-way partitioning problem (pg. 234/730 in the pdf http://web.stanford.edu/~boyd/cvxbook/bv_cvxbook.pdf) which is known to be NP hard.

What this means in layman terms is that if the size of A and B is large, the problem can basically not be solved exactly. At best, one can hope to get algorithms which say statements of the form "We are off by a factor of at most k".

In these cases, the details become important. I would recommend you not code up such algorithms yourself, because they will likely be very inefficient (and maybe even way off!).

The library is not the issue here; the functional form of $f$ can dramatically change the approach

Related Solutions

Solved – Constrained Optimization library for equality and inequality constraints

Both packages, alabama and Rsolnp, contain "[i]mplementations of the augmented lagrange multiplier method for general nonlinear optimization" --- as the optimization task view says --- and are quite reliable and robust. The can handle equality and inequality constraints defined as (nonlinear) functions again.

I have worked with both packages. Sometimes, constraints are a bit easier to formulate with Rsolnp, whereas alabama appears to be a bit faster at times.

There is also the package Rdonlp2 that relies on an external and in the optimization community well-known software library. Unfortunately, its license status is a bit uncertain at the moment.

Solved – Grouping Combinatorial Optimization

Essentially you are asking how to select two groups of sets such that the elements in the first group but not in the second are as similar to some target set $A$ as possible. There are two competing objectives here:

True positive rate (TPR): Proportion of all elements in set A that you obtain.
False positive rate (FPR): Proportion of all elements not in set A that you obtain.

Just as in classification, it is easy to simultaneously maximize TPR and FPR (select all the sets in the first group) or to obtain TPR and FPR of 0 (select none of the sets in the first group), and you're interested in an intermediate solution that simultaneously obtains a high TPR and low FPR.

I would proceed using mathematical optimization, in particular mixed integer linear optimization. I will use the following notation for the input data:

$A$: the target set
$m_{ij}$: Whether item $i$ is included in set $j$ (1 or 0)

I will then define the following decision variables, all of which are binary (take values 1 or 0):

$X_j$: is set $j$ selected in the first group?
$Y_j$: is set $j$ selected in the second group?
$x_i$: is item $i$ in the first group?
$y_i$: is item $i$ in the second group?
$z_i$: is item $i$ selected?

Based on these decision variables, it is easy to obtain formulas for the TPR and FPR that are linear in the decision variables:

\begin{align*} TPR &= \frac{1}{|A|}\sum_{i\in A} z_i \\ FPR &= \frac{1}{|\bar A|}\sum_{i\notin A} z_i \end{align*}

We can make the objective function be a convex combination of the TPR and the negative FPR:

$$ \max \alpha~TPR - (1-\alpha)~FPR $$

By re-solving the optimization problem with $\alpha$ ranging from 0 to 1, we can obtain an efficient frontier trading off TPR and FPR.

We now need a series of constraints to connect the $X_j$ and $Y_j$ decision variables (which sets are in groups 1 and 2) to the other decision variables.

A set is selected in at most one of the two groups:

$$ X_j + Y_j\leq 1~\forall j $$

Connect the $x_i$ variables with the $X_j$ variables:

\begin{align*} x_i &\geq m_{ij}X_j &\forall i,j \\ x_i &\leq \sum_j m_{ij}X_j &\forall i \end{align*}

Similarly connect the $y_i$ variables with the $Y_j$ variables:

\begin{align*} y_i &\geq m_{ij}Y_j &\forall i,j \\ y_i &\leq \sum_j m_{ij}Y_j &\forall i \end{align*}

Connect $z_i$ to $x_i$ and $y_i$ (we want $z_i = x_i(1-y_i)$, but we do this a bit differently to avoid bilinear terms):

\begin{align*} z_i &\geq x_i - y_i &\forall i \\ z_i &\leq x_i &\forall i \\ z_i &\leq 1-y_i &\forall i \end{align*}

There are a few different constraints here, so it's a bit of work to implement this; I've done it in R using the lpSolve package:

library(lpSolve)
opt.with.alpha <- function(A, sets, alpha) {
  I <- sort(unique(c(A, unlist(sets))))
  nI <- length(I)
  J <- seq_along(sets)
  nJ <- length(J)
  nIJ <- length(unlist(sets))
  vars <- rbind(data.frame(type="X", i=NA, j=J),
                data.frame(type="Y", i=NA, j=J),
                data.frame(type="x", i=I, j=NA),
                data.frame(type="y", i=I, j=NA),
                data.frame(type="z", i=I, j=NA))
  xIJ <- matrix(0, nrow=nIJ, ncol=nrow(vars))
  xIJ[cbind(seq_len(nIJ), which(vars$type == "X")[rep(J, sapply(sets, length))])] <- -1
  xIJ[cbind(seq_len(nIJ), which(vars$type == "x")[match(unlist(sets), I)])] <- 1
  xI <- matrix(0, nrow=nI, ncol=nrow(vars))
  xI[cbind(seq_len(nI), which(vars$type == "x"))] <- 1
  xI[cbind(match(unlist(sets), I), which(vars$type == "X")[rep(J, sapply(sets, length))])] <- -1
  yIJ <- matrix(0, nrow=nIJ, ncol=nrow(vars))
  yIJ[cbind(seq_len(nIJ), which(vars$type == "Y")[rep(J, sapply(sets, length))])] <- -1
  yIJ[cbind(seq_len(nIJ), which(vars$type == "y")[match(unlist(sets), I)])] <- 1
  yI <- matrix(0, nrow=nI, ncol=nrow(vars))
  yI[cbind(seq_len(nI), which(vars$type == "y"))] <- 1
  yI[cbind(match(unlist(sets), I), which(vars$type == "Y")[rep(J, sapply(sets, length))])] <- -1
  con <- rbind(t(sapply(J, function(j) vars$type %in% c("X", "Y") & vars$j == j)),
               xIJ, xI, yIJ, yI,
               t(sapply(I, function(i) (vars$type == "z" & vars$i == i) -
                                       (vars$type == "x" & vars$i == i) +
                                       (vars$type == "y" & vars$i == i))),
               t(sapply(I, function(i) (vars$type == "z" & vars$i == i) -
                                       (vars$type == "x" & vars$i == i))),
               t(sapply(I, function(i) (vars$type == "z" & vars$i == i) +
                                       (vars$type == "y" & vars$i == i)))) * 1
  mod <- lp(direction="max",
            objective.in = ifelse(vars$type == "z" & vars$i %in% A, alpha/length(A),
                                  ifelse(vars$type == "z", -(1-alpha)/(nI-length(A)), 0)),
            const.mat=con,
            const.dir=rep(c("<=", ">=", "<=", ">=", "<=", ">=", "<="),
                          c(nJ,   nIJ,  nI,   nIJ,  nI,   nI,   2*nI)),
            const.rhs=rep(c(1, 0, 1), c(nJ, 2*nIJ+4*nI, nI)),
            all.bin = TRUE)
  data.frame(group1 = paste(which(mod$solution[vars$type == "X"] > 0.999), collapse=", "),
             group2 = paste(which(mod$solution[vars$type == "Y"] > 0.999), collapse=", "),
             selectedItems = paste(which(mod$solution[vars$type == "z"] > 0.999), collapse=", "),
             TPR = mean(A %in% which(mod$solution[vars$type == "z"] > 0.999)),
             FPR = mean(setdiff(I, A) %in% which(mod$solution[vars$type == "z"] > 0.999)))
}
opt.sets <- function(A, sets) {
  unique(do.call(rbind, lapply(seq(0, 1, .01), function(alpha) opt.with.alpha(A, sets, alpha=alpha))))
}

Let's now consider a slightly more interesting example than the one presented in the problem:

$A = \{1, 2, 3, 5, 7\}$
$B = \{1, 6\}$
$C = \{1, 5, 6\}$
$D = \{1, 2, 4, 5\}$
$E = \{1, 3, 4, 6, 8\}$

The code then provides an efficient frontier trading off the TPR and FPR:

A <- c(1, 2, 3, 5, 7)
B <- c(1, 6)
C <- c(1, 5, 6)
D <- c(1, 2, 4, 5)
E <- c(1, 3, 4, 6, 8)
opt.sets(A, list(B, C, D, E))
#   group1 group2       selectedItems TPR       FPR
# 1                                   0.0 0.0000000
# 2   1, 3      4                2, 5 0.4 0.0000000
# 4   2, 3      4                2, 5 0.4 0.0000000
# 8      3                 1, 2, 4, 5 0.6 0.3333333
# 9   3, 4        1, 2, 3, 4, 5, 6, 8 0.8 1.0000000

From the output, we see that there are three different achievable efficient solutions:

TPR=0.4, FPR=0: We can achieve this with $B\cup D\setminus E$ or $C\cup D\setminus E$, in both cases obtaining items $\{2, 5\}$.
TPR=0.6, FPR=0.333: We can achieve this with $D = \{1, 2, 4, 5\}$.
TPR=0.8, FPR=1: We can achieve this, for instance, with $C\cup D = \{1, 2, 3, 4, 5, 6, 8\}$.

Best Answer

Related Solutions

Solved – Constrained Optimization library for equality and inequality constraints

Solved – Grouping Combinatorial Optimization

Related Question