MATLAB: What is the way to get an optimum or near optimum combination of different columns (contains 1 and 0 only) to be equal to particular column (In the case, it’s all 1 column)

columncombinationknapsacklinear programmingoptimization

Input = [1 0 0 0; 0 1 1 0; 0 0 0 1; 1 0 0 0; 0 1 0 1; 0 0 1 0; 0 0 1 0; 1 0 0 0; 0 1 0 1] I want the columns which, when combined, should go as close as they can get to a column of ones. Here, if we combine the first, third and fourth column I get a column of ones. What is the algorithm to do this for bigger matrices (120 x 70)?

Best Answer

If you knew an exact solution existed, then intlinprog would be perfect.
By the way, it is a bad idea to use the name Input for a variable name, even if you capitalize the name. It will cause bugs in your code one day, when you mistake and use a lower case i there. So I'll call it something else.
Aeq = [1 0 0 0; 0 1 1 0; 0 0 0 1; 1 0 0 0; 0 1 0 1; 0 0 1 0; 0 0 1 0; 1 0 0 0; 0 1 0 1];
beq = ones(9,1);
So we want to find a vector X, such that
Aeq*X = beq
The requirement is that X must be a vector of integers, either 0 or 1.
A problem may be that there are multiple solutions. So we will use an objective that says that the vector X must have as small a sum as possible. That is what f will do here:
f = ones(4,1);
X = intlinprog(f,1:4,[],[],Aeq,beq,zeros(4,1),ones(4,1))
LP: Optimal objective value is 3.000000.
Optimal solution found.
Intlinprog stopped at the root node because the objective value is within a gap tolerance of the optimal value, options.AbsoluteGapTolerance = 0 (the default value). The
intcon variables are integer within tolerance, options.IntegerTolerance = 1e-05 (the default value).
X =
1
0
1
1
find(X)
ans =
1
3
4
So no problem at all, as long as a solution does exist.
But what if there is no exact solution? Then intlinprog will fail, because there will be no feasible solution found. In fact, this is the case I would usually expect when Aeq has more rows than columns. So the above solution will generally be inadequate. Instead, we want to solve a different problem. That is, solve for the vector X such that
sum(abs(Aeq*X - beq))
is minimized, where X is constrained to be a binary vector as above. We might think we want to solve this as a least squares problem, so minimizing this 2-norm:
norm(Aeq*X - beq)
The problem is, INTLINPROG does not solve this problem, and LSQLIN does not handle integer constraints. So, suppose we had this array?
Aeq = double(rand(120,70) < 0.05);
beq = ones(120,1);
As you can see, lsqlin fails.
Xr = lsqlin(Aeq,beq,[],[],[],[],zeros(70,1),ones(70,1));
Minimum found that satisfies the constraints.
Optimization completed because the objective function is non-decreasing in
feasible directions, to within the default value of the optimality tolerance,
and constraints are satisfied to within the default value of the constraint tolerance.
<stopping criteria details>
>> Xr
Xr =
0.043631
0.3165
0.4857
0.2582
0.47377
0.16141
0.22724
0.50588
0.72435
...
LSQLIN does not understand that the solution vector must be binary, even if I constrain it to lie in [0,1].
So the trick is to solve the minimum sum of absolute values problem using INTLINPROG. We can do that using slack variables. Essentially, we create a whole slew of extra variables, two new variables for each row of Aeq. (I think I explained this in my optimization tips and tricks doc, found on the File Exchange.)
[neq,nvar] = size(Aeq);
f = [zeros(nvar,1);ones(neq*2,1)];
LB = [zeros(nvar,1);zeros(2*neq,1)];
UB = [ones(nvar,1);inf(2*neq,1)];
Anew = [Aeq,eye(neq),-eye(neq)];
intcon = 1:nvar;
Ok, so now how does intlinprog do?
Xb = intlinprog(f,intcon,[],[],Anew,beq,LB,UB);
LP: Optimal objective value is 24.757627.
Cut Generation: Applied 6 Gomory cuts.
Lower bound is 26.302995.
Heuristics: Found 4 solutions using total rounding.
Upper bound is 56.000000.
Relative gap is 52.10%.

Heuristics: Found 2 solutions using rss.
Upper bound is 36.000000.
Relative gap is 26.20%.
Branch and Bound:
nodes total num int integer relative
explored time (s) solution fval gap (%)
27 0.13 7 3.300000e+01 1.146025e+01
74 0.17 8 3.200000e+01 3.281596e+00
135 0.20 8 3.200000e+01 0.000000e+00
Optimal solution found.
Intlinprog stopped because the objective value is within a gap tolerance of the optimal value, options.AbsoluteGapTolerance = 0 (the default value). The intcon variables
are integer within tolerance, options.IntegerTolerance = 1e-05 (the default value).
Well, it thinks it converged. The solution is just the first nvar elements of Xb. The slack variables are ignored now.
find(Xb(1:nvar))
ans =
6
8
9
10
12
13
22
26
27
34
37
42
44
49
53
56
57
63
66
69
sum(abs(Aeq*Xb(1:nvar) - beq))
ans =
32
So not terrible, but not great. Remember that we had 120 equations in 70 unknowns, but the unknowns are constrained to be binary. To convince you that a solution MAY be found if Aeq has fewer rows. So here 25 rows, with 75 columns.
Aeq = double(rand(25,75) < 0.1);
beq = ones(25,1);
[neq,nvar] = size(Aeq);
f = [zeros(nvar,1);ones(neq*2,1)];
LB = [zeros(nvar,1);zeros(2*neq,1)];
UB = [ones(nvar,1);inf(2*neq,1)];
Anew = [Aeq,eye(neq),-eye(neq)];
intcon = 1:nvar;
Xb = intlinprog(f,intcon,[],[],Anew,beq,LB,UB);
Did it work?
find(Xb(1:nvar))
ans =
9
11
15
19
21
32
43
55
71
sum(abs(Aeq*Xb(1:nvar) - beq))
ans =
3.7526e-14
So intlinprog was successful. And even when not successful, it will do the best it can to solve the L1 norm regression.