Solved – Non negative least squares with minimal colinearity

least squaresmulticollinearitymultiple regressionoptimizationregression

I am trying to fit a dataset using the standard NNLS (non-negative least squares) approach.
Formally:

$\min_x ||Ax-b||^2_2$ s.t. $x\ge0$

This is a quadratic program and can be solved optimally. The solution I find fits the data reasonably well and is relatively sparse (which is good for me) but has an undesirable property: I find that it may assign high weights to features (columns of $A$) that are highly correlated. I would like my solution to be such that the correlation between support features (features that have high weights) will be minimal. Note that in my setting all entries of $A$ and $b$ are non-negative, and I can normalize the columns of $A$ so that the norm of each column is 1.

I tried approaching the problem by directly minimizing possible formalizations of the quantity I want. Note that $A^TA$ is the covariance matrix, so it could potentially be a good thing to minimize (perhaps not the diagonal, though). But minimizing $||Ax-b||^2_2+x^TA^TAx$ does not make any sense and minimizing $||Ax-b||^2_2+x^T(A^TA-I)x$ makes the optimization non-convex (because you get negative eigenvalues). With this approach, I can see that intuitively this is a bit like doing the opposite of ridge-regression/Tikhonov regularization.

I also tried L1 regularization ($||Ax-b||^2_2+\lambda||x||_1$) just for the heck of it, but it doesn't solve this problem.

Formally, I am looking to modify the objective function such that it penalizes solutions in which features that are "similar" to each other both get high weights. There are several ways to formalize this notion, for example one being that I would like the support columns to be as orthogonal as possible.

Does anyone have ideas on how else to approach this?

Best Answer

I think minimizing $\| Ax -y \|^2 + \lambda x ^\top A^\top Ax$ does make sense. think of it as just a particular non-diagonal covariance gaussian prior. then you can vary $\lambda$ (and cross validate) to achieve different error/feature support tradeoffs.