Here is a simple Julia script. If you translate it to another language beware of the nested loops. Julia handles these efficiently but they should be vectorized for Matlab or Python.
The first time the script is run it will create tab-separated-values (TSV) files for the $X$ and $W$ matrices. On subsequent runs, the script will read the TSV files, execute $k_{max}$ iterations, update the TSV files, and exit.
Thus you can intermittently refine the solution until you run out of patience.
#!/usr/bin/env julia
# Sequential Coordinate-wise algorithm for Non-Negative Least-Squares
# as described on pages 10-11 of
# http://users.wfu.edu/plemmons/papers/nonneg.pdf
#
# Convergence is painfully slow, but unlike most other NNLS
# algorithms the objective function is reduced at each step.
#
# The algorithm described in the PDF was modified from its
# original vector form: |Ax - b|²
# to the matrix form: |LXKᵀ - M|² + λ|X|²
#
# and to include the regularization term.
using LinearAlgebra, MAT, DelimitedFiles
function main()
matfile = "problem.mat"
Xfile = "problem.mat.X.tsv"
Wfile = "problem.mat.W.tsv"
# read the matrices from the Matlab file
f = matopen(matfile)
K = read(f,"K1"); println("K: size = $(size(K)),\t rank = $(rank(K))")
L = read(f,"K2"); println("L: size = $(size(L)),\t rank = $(rank(L))")
M = read(f, "M"); println("M: size = $(size(M)),\t rank = $(rank(M))")
# S = read(f,"S00");println("S: size = $(size(S)),\t rank = $(rank(S))")
close(f)
A = L'L
B = K'K
C = -L'M*K
m,n = size(C)
λ = 1/10 # regularization parameter
kmax = 100 # maximum iterations
# specify the size of the work arrays
X = 0*C
W = 1*C
H = A[:,1] * B[:,1]'
# resume from latest saved state ... or reset to initial conditions
try
X = readdlm(Xfile); println("X: size = $(size(X)), extrema = $(extrema(X))")
W = readdlm(Wfile); println("W: size = $(size(W)), extrema = $(extrema(W))")
println()
catch
@warn "Could not read the saved X,W matrices; re-initializing."
X = 0*C
W = 1*C
end
fxn = (norm(L*X*K' - M)^2 + λ*norm(X)^2) / 2
println("at step 0, fxn = $fxn")
k = 0
while k < kmax
for i = 1:m
for j = 1:n
mul!(H, A[:,i], B[:,j]')
H[i,j] += λ
δ = min( X[i,j], W[i,j]/H[i,j] )
X[i,j] -= δ
H .*= δ
W .-= H
end
end
k += 1
fx2 = (norm(L*X*K' - M)^2 + λ*norm(X)^2) / 2
println("after step $k, fxn = $fx2")
# convergence check
if fx2 ≈ fxn; break; end
fxn = fx2
end
# save the current state for the next run
writedlm(Xfile, X)
writedlm(Wfile, W)
# peek at the current solution
println("\nsummary of current solution")
println(" vector(X) = $(X[1:4]) ... $(X[end-3:end])")
println("extrema(X) = $(extrema(X))")
end
# invoke the main function
main()
I mentioned in the comments that the problem looks like quadratic programming. And indeed it is. Here are the details. (There is a lot of computation, so there might be mistakes.)
For convenience, put $y = Ax$, which is a random vector in $\mathbb{R}^n$. You are trying to minimize $\text{Var}(y_1) + \dotsb + \text{Var}(y_n)$. Let's express this objective in terms of what's known.
We have a constraint that $\mathbb{E}[Ax] = b$. The covariance matrix of $y$ is then $\Sigma = \mathbb{E}[(y - b)(y - b)^T]$, and $\text{Var}(y_1) + \dotsb + \text{Var}(y_n)$ is exactly the sum of diagonals of $\Sigma$, i.e., $\text{tr}(\Sigma)$.
Now let's simplify this even further.
\begin{align*}
(y - b)(y - b)^T & = yy^T - b y^T - yb^T + bb^T\\
& = Axx^T A^T - b x^T A^T - Ax b^T + bb^T
\end{align*}
The covariance matrix is
\begin{align*}
\Sigma & = \mathbb{E}[(y - b)(y - b)^T]\\
& = \mathbb{E}[Axx^T A^T] - \mathbb{E}[bx^T A^T] - \mathbb{E}[Axb^T] + bb^T
\end{align*}
and
\begin{align*}
\text{tr}(\Sigma) & = \text{tr}(\mathbb{E}[Axx^T A^T]) - \text{tr}(\mathbb{E}[bx^T A^T]) - \text{tr}(\mathbb{E}[Axb^T]) + \text{tr}(bb^T)
\end{align*}
For the first term, we use two tricks: (1) the trace commutes with expectation; (2) the trace is cyclically invariant. We get
\begin{align*}
\text{tr}(\mathbb{E}[Axx^T A^T]) & = \mathbb{E}[\text{tr}(Axx^T A^T)]\\
& = \mathbb{E}[\text{tr}(A^T A xx^T)]\\
& = \text{tr}(\mathbb{E}[A^T A xx^T])\\
& = \text{tr}(\mathbb{E}[A^T A] xx^T)
\end{align*}
Since you know the mean and variance of $A$, $\mathbb{E}[A^T A]$ can be computed from this information. Let's put $M = \mathbb{E}[A^T A]$. Note that $M$ is a positive semidefinite matrix of size $n \times n$. Continuing, we have
\begin{align*}
\text{tr}(\mathbb{E}[Axx^T A^T]) & = \text{tr}(M xx^T)\\
& = \text{tr}(x^T M x)\\
& = x^T M x
\end{align*}
That takes care of the first term.
For the second term, put $F = \mathbb{E}[A] \in \mathbb{R}^{m \times n}$, which is known. We get
\begin{align*}
- \text{tr}(\mathbb{E}[b x^T A^T]) & = - \text{tr}(b x^T F^T)\\
& = - \text{tr}(x^T F^T b)\\
& = - x^T F^T b
\end{align*}
Do the same for the third term.
So in the end, our objective $\text{Var}(y_1) + \dotsb + \text{Var}(y_n)$ simplifies to
$$ x^T M x - 2 x^T F^T b + \text{tr}(bb^T) $$
This is just a quadratic expression in $x$. Moreover, since $M$ is positive semidefinite, this is a convex quadratic expression.
The whole problem boils down to the following quadratic program:
\begin{align*}
\underset{x \in \mathbb{R}^n}{\text{minimize}} & \quad x^T M x - 2x^T F^T b + \text{tr}(bb^T)\\
\text{subject to} & \quad F x = b\\
& \quad x \geq 0,
\end{align*}
where $M = \mathbb{E}[A^T A] \in \mathbb{R}^{n \times n}$, $F = \mathbb{E}[A] \in \mathbb{R}^{m \times n}$. Pretty much any convex optimization package can handle it. (For example, check out CVXPY.)
Update: As fedja pointed out in the comment, the problem simplifies further. Since $Fx = b$, the term $-2x^T F^T b$ is just $-2b^T b$, which is a constant. For optimization purposes, we can drop it, as well as the last term $\text{tr}(bb^T)$ from the objective. In the end, we just have
\begin{align*}
\underset{x \in \mathbb{R}^n}{\text{minimize}} & \quad x^T M x\\
\text{subject to} & \quad F x = b\\
& \quad x \geq 0,
\end{align*}
It's still a quadratic program.
Best Answer
The problem is given by:
The problem is equivalent to:
$$ \begin{alignat*}{3} \arg \min_{x} & \quad & \left\| A \boldsymbol{x} - \hat{\boldsymbol{b}} \right\|_{2}^{2} \\ \text{subject to} & \quad & {x}_{i} \geq 0 \\ & \quad & \boldsymbol{x}^{T} \boldsymbol{1} = 1 \end{alignat*} $$
Where $ A = \left[ \boldsymbol{a}_{1}, \boldsymbol{a}_{2}, \ldots \right] $ and $ \hat{\boldsymbol{b}} = N \boldsymbol{b} $.
This is basically non negative least squares with Linear Equality Constraints.
You could easily solve it with many Least Squares solvers.
You could also utilize Orthogonal Projection onto the Unit Simplex with some acceleration (FISTA like) in the Projected Gradient Descent framework and have low memory and pretty fast solver even for large size problem.
When using the Projected Gradient Descent you should pre calculate $ {A}^{T} A $ and $ {A}^{T} \hat{\boldsymbol{b}} $ then the calculation per loop should be pretty small.