Essentially you are asking how to select two groups of sets such that the elements in the first group but not in the second are as similar to some target set $A$ as possible. There are two competing objectives here:
- True positive rate (TPR): Proportion of all elements in set A that you obtain.
- False positive rate (FPR): Proportion of all elements not in set A that you obtain.
Just as in classification, it is easy to simultaneously maximize TPR and FPR (select all the sets in the first group) or to obtain TPR and FPR of 0 (select none of the sets in the first group), and you're interested in an intermediate solution that simultaneously obtains a high TPR and low FPR.
I would proceed using mathematical optimization, in particular mixed integer linear optimization. I will use the following notation for the input data:
- $A$: the target set
- $m_{ij}$: Whether item $i$ is included in set $j$ (1 or 0)
I will then define the following decision variables, all of which are binary (take values 1 or 0):
- $X_j$: is set $j$ selected in the first group?
- $Y_j$: is set $j$ selected in the second group?
- $x_i$: is item $i$ in the first group?
- $y_i$: is item $i$ in the second group?
- $z_i$: is item $i$ selected?
Based on these decision variables, it is easy to obtain formulas for the TPR and FPR that are linear in the decision variables:
\begin{align*}
TPR &= \frac{1}{|A|}\sum_{i\in A} z_i \\
FPR &= \frac{1}{|\bar A|}\sum_{i\notin A} z_i
\end{align*}
We can make the objective function be a convex combination of the TPR and the negative FPR:
$$
\max \alpha~TPR - (1-\alpha)~FPR
$$
By re-solving the optimization problem with $\alpha$ ranging from 0 to 1, we can obtain an efficient frontier trading off TPR and FPR.
We now need a series of constraints to connect the $X_j$ and $Y_j$ decision variables (which sets are in groups 1 and 2) to the other decision variables.
A set is selected in at most one of the two groups:
$$
X_j + Y_j\leq 1~\forall j
$$
Connect the $x_i$ variables with the $X_j$ variables:
\begin{align*}
x_i &\geq m_{ij}X_j &\forall i,j \\
x_i &\leq \sum_j m_{ij}X_j &\forall i
\end{align*}
Similarly connect the $y_i$ variables with the $Y_j$ variables:
\begin{align*}
y_i &\geq m_{ij}Y_j &\forall i,j \\
y_i &\leq \sum_j m_{ij}Y_j &\forall i
\end{align*}
Connect $z_i$ to $x_i$ and $y_i$ (we want $z_i = x_i(1-y_i)$, but we do this a bit differently to avoid bilinear terms):
\begin{align*}
z_i &\geq x_i - y_i &\forall i \\
z_i &\leq x_i &\forall i \\
z_i &\leq 1-y_i &\forall i
\end{align*}
There are a few different constraints here, so it's a bit of work to implement this; I've done it in R using the lpSolve package:
library(lpSolve)
opt.with.alpha <- function(A, sets, alpha) {
I <- sort(unique(c(A, unlist(sets))))
nI <- length(I)
J <- seq_along(sets)
nJ <- length(J)
nIJ <- length(unlist(sets))
vars <- rbind(data.frame(type="X", i=NA, j=J),
data.frame(type="Y", i=NA, j=J),
data.frame(type="x", i=I, j=NA),
data.frame(type="y", i=I, j=NA),
data.frame(type="z", i=I, j=NA))
xIJ <- matrix(0, nrow=nIJ, ncol=nrow(vars))
xIJ[cbind(seq_len(nIJ), which(vars$type == "X")[rep(J, sapply(sets, length))])] <- -1
xIJ[cbind(seq_len(nIJ), which(vars$type == "x")[match(unlist(sets), I)])] <- 1
xI <- matrix(0, nrow=nI, ncol=nrow(vars))
xI[cbind(seq_len(nI), which(vars$type == "x"))] <- 1
xI[cbind(match(unlist(sets), I), which(vars$type == "X")[rep(J, sapply(sets, length))])] <- -1
yIJ <- matrix(0, nrow=nIJ, ncol=nrow(vars))
yIJ[cbind(seq_len(nIJ), which(vars$type == "Y")[rep(J, sapply(sets, length))])] <- -1
yIJ[cbind(seq_len(nIJ), which(vars$type == "y")[match(unlist(sets), I)])] <- 1
yI <- matrix(0, nrow=nI, ncol=nrow(vars))
yI[cbind(seq_len(nI), which(vars$type == "y"))] <- 1
yI[cbind(match(unlist(sets), I), which(vars$type == "Y")[rep(J, sapply(sets, length))])] <- -1
con <- rbind(t(sapply(J, function(j) vars$type %in% c("X", "Y") & vars$j == j)),
xIJ, xI, yIJ, yI,
t(sapply(I, function(i) (vars$type == "z" & vars$i == i) -
(vars$type == "x" & vars$i == i) +
(vars$type == "y" & vars$i == i))),
t(sapply(I, function(i) (vars$type == "z" & vars$i == i) -
(vars$type == "x" & vars$i == i))),
t(sapply(I, function(i) (vars$type == "z" & vars$i == i) +
(vars$type == "y" & vars$i == i)))) * 1
mod <- lp(direction="max",
objective.in = ifelse(vars$type == "z" & vars$i %in% A, alpha/length(A),
ifelse(vars$type == "z", -(1-alpha)/(nI-length(A)), 0)),
const.mat=con,
const.dir=rep(c("<=", ">=", "<=", ">=", "<=", ">=", "<="),
c(nJ, nIJ, nI, nIJ, nI, nI, 2*nI)),
const.rhs=rep(c(1, 0, 1), c(nJ, 2*nIJ+4*nI, nI)),
all.bin = TRUE)
data.frame(group1 = paste(which(mod$solution[vars$type == "X"] > 0.999), collapse=", "),
group2 = paste(which(mod$solution[vars$type == "Y"] > 0.999), collapse=", "),
selectedItems = paste(which(mod$solution[vars$type == "z"] > 0.999), collapse=", "),
TPR = mean(A %in% which(mod$solution[vars$type == "z"] > 0.999)),
FPR = mean(setdiff(I, A) %in% which(mod$solution[vars$type == "z"] > 0.999)))
}
opt.sets <- function(A, sets) {
unique(do.call(rbind, lapply(seq(0, 1, .01), function(alpha) opt.with.alpha(A, sets, alpha=alpha))))
}
Let's now consider a slightly more interesting example than the one presented in the problem:
- $A = \{1, 2, 3, 5, 7\}$
- $B = \{1, 6\}$
- $C = \{1, 5, 6\}$
- $D = \{1, 2, 4, 5\}$
- $E = \{1, 3, 4, 6, 8\}$
The code then provides an efficient frontier trading off the TPR and FPR:
A <- c(1, 2, 3, 5, 7)
B <- c(1, 6)
C <- c(1, 5, 6)
D <- c(1, 2, 4, 5)
E <- c(1, 3, 4, 6, 8)
opt.sets(A, list(B, C, D, E))
# group1 group2 selectedItems TPR FPR
# 1 0.0 0.0000000
# 2 1, 3 4 2, 5 0.4 0.0000000
# 4 2, 3 4 2, 5 0.4 0.0000000
# 8 3 1, 2, 4, 5 0.6 0.3333333
# 9 3, 4 1, 2, 3, 4, 5, 6, 8 0.8 1.0000000
From the output, we see that there are three different achievable efficient solutions:
- TPR=0.4, FPR=0: We can achieve this with $B\cup D\setminus E$ or $C\cup D\setminus E$, in both cases obtaining items $\{2, 5\}$.
- TPR=0.6, FPR=0.333: We can achieve this with $D = \{1, 2, 4, 5\}$.
- TPR=0.8, FPR=1: We can achieve this, for instance, with $C\cup D = \{1, 2, 3, 4, 5, 6, 8\}$.
Best Answer
The problem you are trying to solve is a variant of the two-way partitioning problem (pg. 234/730 in the pdf http://web.stanford.edu/~boyd/cvxbook/bv_cvxbook.pdf) which is known to be NP hard.
What this means in layman terms is that if the size of A and B is large, the problem can basically not be solved exactly. At best, one can hope to get algorithms which say statements of the form "We are off by a factor of at most k".
In these cases, the details become important. I would recommend you not code up such algorithms yourself, because they will likely be very inefficient (and maybe even way off!).
The library is not the issue here; the functional form of $f$ can dramatically change the approach