Solved – Can least-squares linear regression ever produce no solution at all

linear modelmachine learningmatrixregression

Is it ever possible for least-squares linear regression (linear in both features and weights) NOT to produce a solution? That is, after we set each partial derivative to zero, can the resulting system of n linear equations in n weights be "inconsistent"? An example of "inconsistent system of linear equations" is two parallel lines (for 2 weights) or two parallel planes (for 3 weights). If the answer is yes, can you please give a simple numerical example with 2 or 3 data points?

Best Answer

A linear system $Ax=b$ can have no-solution case due to the inconsistencies introduced in its equations. But, linear regression does not aim to satisfy $Ax=b$, it merely tries to minimize $||Ax-b||_2$. This is a simple optimization problem, and you'll have your solution, either unique or infinitely many. Intuitively, one can always draw a line as best fit to a given set of points.

Related Solutions

Solved – Penalizing the Ordinary Least Squares estimation

I don't use ridge regression all that much, so I'll focus on (A) and (C):

(A) While the Lasso is traditionally motivated by the p > n scenario, it is mathematically well-defined when n > p, i.e. the solution exists and is unique assuming your design matrix is sufficiently well-behaved. All the same formulas and error bounds continue to hold when n > p. All the algorithms (at least that I know of) that produce Lasso estimates should also work when n > p.

Most of the time if n > p (especially if p is small) you probably want to think carefully about whether or not the Lasso is your best option. As usual, it is problem dependent. That being said, in some situations the Lasso may be appropriate when n > p: For example, if you have 10,000 predictors and 15,000 observations, it's likely that you will still want some kind of regularization to trim down the number of predictors and kill some of the noise. The Lasso may be helpful here.

(B) Ridge regression can be used in the p > n situation to alleviate singularity issues in the design matrix. This may be useful if sparsity / feature selection is not important. Moreover, ridge regression has a very nice closed form solution that is easily interpreted, and this can be helpful in practice. In essence, you add a positive term to the main diagonal, which improves the regularity of the sample covariance (specifically, it removes vanishing eigenvalues as long as enough regularization is applied).

(I'll leave it the experts to address this one more thoughtfully.)

(C) Soft thresholding and the Lasso are closely related, but not identical. One interpretation of soft thresholding is as the special case of Lasso regression when the predictors are orthogonal, which is of course a restrictive assumption.

Another interpretation of soft thresholding is as the one-at-a-time update in coordinate descent algorithms for the Lasso. I recommend the paper "Pathwise Coordinate Optimization" by Friedman et al for an introduction to these concepts. For a slightly more recent and more general treatment, there is the excellent paper "SparseNet: Coordinate Descent With Nonconvex Penalties" by Mazumder et al.

Solved – Fitting a straight line: Total Least Squares or Ordinary Least Squares

It is very good that you explicitly state your goal, i.e. "I want ... to understand the influence sea surface temperatures (x-axis) have on land temperature over a particular region (y-axis)". Too often this aspect is ignored in these sorts of questions!

First, as always it is important to understand that correlation does not imply causation.

Now, the two approaches to line fitting differ statistically in that OLS treats $x$ as "error free", while TLS (a.k.a. "errors in variables" linear regression) treats uncertainty in both $x$ and $y$. (These are treated symmetrically in the case of orthogonal least squares.)

The two approaches also differ in their goals: Orthogonal least squares is similar to PCA, and is essentially fitting a multivariate Gaussian joint distribution $p[x,y]$ to the data (in the 2D case, at least). Ordinary least squares is more oriented to fitting a set of conditional Gaussian distributions $p[y \vert x]$ to the data.

Now, as your $x$ and $y$ variables have the same units (both are temperatures), and similar ranges, then orthogonal least squares is certainly reasonable. It is difficult to tell (given the large size, low transparency, and high density of over-printed points), but the TLS line appears to better capture the data as well.

A summary of the usefulness of the two approaches might be as follows:

If your goal is to constrain the distribution of $y$ given a precise value of $x$, then the OLS curve is what you want. (For example $R^2$ gives the reduction in variance for $y|x$ vs. $y$).
If your goal is to constrain the "independent components" of the 2D $(x,y)$ data, then TLS is better. For example this first principle component may have a common cause, in terms of the system dynamics.

Given your stated goal, it appears that the OLS line ($p[y|x])$ is what you are probably after.

However, note that OLS assumes that the residual variance is independent of $x$ (i.e. $\sigma^2_{y|x}\neq f[x]$), a condition known by the colorful term "homoskedastic". This is something you should check (e.g. by plotting residuals). As noted above, your plot is difficult to judge by eye, but it appears the ($y$) spread around the OLS line may have some variations in the $x$ direction. (So, as noted above, the TLS line may be a more reliable fit.)

Best Answer

Related Solutions

Solved – Penalizing the Ordinary Least Squares estimation

Solved – Fitting a straight line: Total Least Squares or Ordinary Least Squares

Related Question