Solve Large Scale Linear Least Squares with Simplex Constraints

algorithmsconvex optimizationMATLABoperations researchoptimization

I need to solve a rather simple optimization problem, however the structure of the problem makes out-of-the-box commercial solvers fail. The problem is:\begin{aligned}
&\min_{x_i}\|\sum_{i=1}^{10}x_i\mathbf{a}_i – \mathbf{b}\|_2^2 \\
&s.t.: \sum_{i=1}^{10}x_i = 1,\; x_i\geq0
\end{aligned}
The decision variables are $x_i$, which are scalars, and typically the problem will have 10 variables. The complexity comes from the fact that $\mathbf{a}_i, \mathbf{b}$ are VERY LONG constant vectors, varying from 10,000 to 100,000 elements and possibly more. Without going too much into details, black-box solvers fail since they impute the variable over vectors, thus converting a 10-D problem to 100K-vars problem.

Since the problem has a simple structure (only complicated by the constant vectors), I think that interior point algorithms can perform well here. Also, the Hessian is very simple and easy to compute – it's $ij$-th entry is simply $\langle \mathbf{a}_i, \mathbf{a}_j \rangle$, with size $10\times 10$. I'm very familiar with optimization, however my experience is mostly with first-order algorithms. I'm almost clueless when it come to IPMs and not sure how to implement an appropriate algorithm. Does anyhow have a ready script I can use? Or any other tips that might help?

Best Answer

The problem is given by:

\begin{aligned} &\min_{x_i}\|\sum_{i=1}^{N}x_i\mathbf{a}_i - \mathbf{b}\|_2^2 \\ &s.t.: \sum_{i=1}^{N}x_i = 1,\; x_i\geq0 \end{aligned}

The problem is equivalent to:

$$ \begin{alignat*}{3} \arg \min_{x} & \quad & \left\| A \boldsymbol{x} - \hat{\boldsymbol{b}} \right\|_{2}^{2} \\ \text{subject to} & \quad & {x}_{i} \geq 0 \\ & \quad & \boldsymbol{x}^{T} \boldsymbol{1} = 1 \end{alignat*} $$

Where $ A = \left[ \boldsymbol{a}_{1}, \boldsymbol{a}_{2}, \ldots \right] $ and $ \hat{\boldsymbol{b}} = N \boldsymbol{b} $.

This is basically non negative least squares with Linear Equality Constraints.
You could easily solve it with many Least Squares solvers.

You could also utilize Orthogonal Projection onto the Unit Simplex with some acceleration (FISTA like) in the Projected Gradient Descent framework and have low memory and pretty fast solver even for large size problem.

When using the Projected Gradient Descent you should pre calculate $ {A}^{T} A $ and $ {A}^{T} \hat{\boldsymbol{b}} $ then the calculation per loop should be pretty small.

Related Solutions

Solve Large Scale Matrix Least Squares with Frobenius Regularization Problem efficiently

Here is a simple Julia script. If you translate it to another language beware of the nested loops. Julia handles these efficiently but they should be vectorized for Matlab or Python.

The first time the script is run it will create tab-separated-values (TSV) files for the $X$ and $W$ matrices. On subsequent runs, the script will read the TSV files, execute $k_{max}$ iterations, update the TSV files, and exit.

Thus you can intermittently refine the solution until you run out of patience.

#!/usr/bin/env  julia

#  Sequential Coordinate-wise algorithm for Non-Negative Least-Squares
#  as described on pages 10-11 of
#     http://users.wfu.edu/plemmons/papers/nonneg.pdf
#
#  Convergence is painfully slow, but unlike most other NNLS
#  algorithms the objective function is reduced at each step.
#
#  The algorithm described in the PDF was modified from its
#  original vector form:  |Ax - b|²
#    to the matrix form:  |LXKᵀ - M|²  +  λ|X|²
#
#  and to include the regularization term.

using LinearAlgebra, MAT, DelimitedFiles

function main()
  matfile = "problem.mat"
  Xfile   = "problem.mat.X.tsv"
  Wfile   = "problem.mat.W.tsv"

# read the matrices from the Matlab file
  f = matopen(matfile)
    K = read(f,"K1"); println("K: size = $(size(K)),\t rank = $(rank(K))")
    L = read(f,"K2"); println("L: size = $(size(L)),\t rank = $(rank(L))")
    M = read(f, "M"); println("M: size = $(size(M)),\t rank = $(rank(M))")
  # S = read(f,"S00");println("S: size = $(size(S)),\t rank = $(rank(S))")
  close(f)

  A = L'L
  B = K'K
  C = -L'M*K
  m,n = size(C)
  λ = 1/10     # regularization parameter
  kmax = 100   # maximum iterations


# specify the size of the work arrays
  X = 0*C
  W = 1*C
  H = A[:,1] * B[:,1]'

# resume from latest saved state ... or reset to initial conditions
  try
     X = readdlm(Xfile);  println("X: size = $(size(X)), extrema = $(extrema(X))")
     W = readdlm(Wfile);  println("W: size = $(size(W)), extrema = $(extrema(W))")
     println()
  catch
     @warn "Could not read the saved X,W matrices; re-initializing."
     X = 0*C
     W = 1*C
  end

  fxn = (norm(L*X*K' - M)^2 + λ*norm(X)^2) / 2
  println("at step 0, fxn = $fxn")

  k = 0
  while k < kmax
     for i = 1:m
         for j = 1:n
             mul!(H, A[:,i], B[:,j]')
             H[i,j] += λ
             δ = min( X[i,j], W[i,j]/H[i,j] )
             X[i,j] -= δ
             H .*= δ
             W .-= H
         end
     end
     k += 1
     fx2 = (norm(L*X*K' - M)^2 + λ*norm(X)^2) / 2
     println("after step $k, fxn = $fx2")

     # convergence check
     if fx2 ≈ fxn; break; end
     fxn = fx2
  end

# save the current state for the next run
  writedlm(Xfile, X)
  writedlm(Wfile, W)

# peek at the current solution
  println("\nsummary of current solution")
  println(" vector(X) = $(X[1:4]) ... $(X[end-3:end])")
  println("extrema(X) = $(extrema(X))")
end

# invoke the main function                                           
main()

Non-negative least-squares with random variables

I mentioned in the comments that the problem looks like quadratic programming. And indeed it is. Here are the details. (There is a lot of computation, so there might be mistakes.)

For convenience, put $y = Ax$, which is a random vector in $\mathbb{R}^n$. You are trying to minimize $\text{Var}(y_1) + \dotsb + \text{Var}(y_n)$. Let's express this objective in terms of what's known.

We have a constraint that $\mathbb{E}[Ax] = b$. The covariance matrix of $y$ is then $\Sigma = \mathbb{E}[(y - b)(y - b)^T]$, and $\text{Var}(y_1) + \dotsb + \text{Var}(y_n)$ is exactly the sum of diagonals of $\Sigma$, i.e., $\text{tr}(\Sigma)$.

Now let's simplify this even further. \begin{align*} (y - b)(y - b)^T & = yy^T - b y^T - yb^T + bb^T\\ & = Axx^T A^T - b x^T A^T - Ax b^T + bb^T \end{align*} The covariance matrix is \begin{align*} \Sigma & = \mathbb{E}[(y - b)(y - b)^T]\\ & = \mathbb{E}[Axx^T A^T] - \mathbb{E}[bx^T A^T] - \mathbb{E}[Axb^T] + bb^T \end{align*} and \begin{align*} \text{tr}(\Sigma) & = \text{tr}(\mathbb{E}[Axx^T A^T]) - \text{tr}(\mathbb{E}[bx^T A^T]) - \text{tr}(\mathbb{E}[Axb^T]) + \text{tr}(bb^T) \end{align*} For the first term, we use two tricks: (1) the trace commutes with expectation; (2) the trace is cyclically invariant. We get \begin{align*} \text{tr}(\mathbb{E}[Axx^T A^T]) & = \mathbb{E}[\text{tr}(Axx^T A^T)]\\ & = \mathbb{E}[\text{tr}(A^T A xx^T)]\\ & = \text{tr}(\mathbb{E}[A^T A xx^T])\\ & = \text{tr}(\mathbb{E}[A^T A] xx^T) \end{align*} Since you know the mean and variance of $A$, $\mathbb{E}[A^T A]$ can be computed from this information. Let's put $M = \mathbb{E}[A^T A]$. Note that $M$ is a positive semidefinite matrix of size $n \times n$. Continuing, we have \begin{align*} \text{tr}(\mathbb{E}[Axx^T A^T]) & = \text{tr}(M xx^T)\\ & = \text{tr}(x^T M x)\\ & = x^T M x \end{align*} That takes care of the first term.

For the second term, put $F = \mathbb{E}[A] \in \mathbb{R}^{m \times n}$, which is known. We get \begin{align*} - \text{tr}(\mathbb{E}[b x^T A^T]) & = - \text{tr}(b x^T F^T)\\ & = - \text{tr}(x^T F^T b)\\ & = - x^T F^T b \end{align*}

Do the same for the third term.

So in the end, our objective $\text{Var}(y_1) + \dotsb + \text{Var}(y_n)$ simplifies to $$ x^T M x - 2 x^T F^T b + \text{tr}(bb^T) $$ This is just a quadratic expression in $x$. Moreover, since $M$ is positive semidefinite, this is a convex quadratic expression.

The whole problem boils down to the following quadratic program: \begin{align*} \underset{x \in \mathbb{R}^n}{\text{minimize}} & \quad x^T M x - 2x^T F^T b + \text{tr}(bb^T)\\ \text{subject to} & \quad F x = b\\ & \quad x \geq 0, \end{align*} where $M = \mathbb{E}[A^T A] \in \mathbb{R}^{n \times n}$, $F = \mathbb{E}[A] \in \mathbb{R}^{m \times n}$. Pretty much any convex optimization package can handle it. (For example, check out CVXPY.)

Update: As fedja pointed out in the comment, the problem simplifies further. Since $Fx = b$, the term $-2x^T F^T b$ is just $-2b^T b$, which is a constant. For optimization purposes, we can drop it, as well as the last term $\text{tr}(bb^T)$ from the objective. In the end, we just have \begin{align*} \underset{x \in \mathbb{R}^n}{\text{minimize}} & \quad x^T M x\\ \text{subject to} & \quad F x = b\\ & \quad x \geq 0, \end{align*} It's still a quadratic program.

Best Answer

Related Solutions

Solve Large Scale Matrix Least Squares with Frobenius Regularization Problem efficiently

Non-negative least-squares with random variables

Related Question