Regression Significance – How to Calculate P-Values in Constrained (Non-Negative) Least Squares Regression

constrained regressionleast squaresregressionstatistical significance

I have been using Matlab to perform unconstrained least squares (ordinary least squares) and it automatically outputs the coefficients, test statistic and the p-values.

My question is, upon performing constrained least squares (strictly nonnegative coefficients), it only outputs the coefficients, WITHOUT test statistics, p-values.

Is it possible to calculate these values to ensure significance? And why is it not directly available on the software (or any other software for that matter?)

Best Answer

Solving a non-negative least squares (NNLS) is based on an algorithm which makes it different from regular least squares.

Algebraic expression for standard error (does not work)

With regular least squares you can express p-values by using a t-test in combination with estimates for the variance of the coefficients.

This expression for the sample variance of the estimate of the coefficients $\hat\theta$ is $$Var(\hat\theta) = \sigma^2(X^TX)^{-1}$$ The variance of the errors $\sigma$ will generally be unknown but it can be estimated using the residuals. This expression can be derived algebraically starting from the expression for the coefficients in terms of the measurements $y$

$$\hat\theta = (X^TX)^{-1} X^T y$$

This implies/assumes that the $\theta$ can be negative, and so it breaks down when the coefficients are restricted.

Inverse of Fisher information matrix (not applicable)

The variance/distribution of the estimate of the coefficients also asymptotically approaches the observed Fisher information matrix:

$$(\hat\theta-\theta) \xrightarrow{d} N(0,\mathcal{I}(\hat\theta))$$

But I am not sure whether this applies well here. The NNLS estimate is not an unbiased estimate.

Monte Carlo Method

Whenever the expressions become too complicated you can use a computational method to estimate the error. With the Monte Carlo Method you simulate the distribution of the randomness of the experiment by simulating repetitions of the experiment (recalculating/modelling new data) and based on this you estimate the variance of the coefficients.

What you could do is take the observed estimates of the model coefficients $\hat\theta$ and residual variance $\hat\sigma$ and based on this compute new data (a couple of thousand repetitions, but it depends on how much precision you wish) from which you can observe the distribution (and variation and derived from this the estimate for the error) for the coefficients. (and there are more complicated schemes to perform this modelling)

Related Question