Least Squares Regression

least squareslinear algebramachine learning

Suppose you are given a set of $n$ points $\{(x_1, y_1), … (x_n, y_n)\}$ and $x_i \neq x_j$ $\forall i \neq j$.
It is a known fact that there always exists a polynomial $a(x) = w_0 + w_1x + \dots + w_{n-1}x^{n-1}$ of degree less then $n$ (some coefficients may be $0$) that goes exactly through all this n points. To find this polynomial we can use the Least Squares solution or even easier – just solve an $n$ by $n$ linear system with Vandermonde matrix which is always invertible.

Question: If we try to fit a polynomial of degree more than $n-1$ (e.g. $n$) using Least Squares we will get some garbage result. But the intuitive solution could be the same coefficients as for optimal polynomial of degree n-1 discussed above (for which MSE is already $0$) and with $w_n = 0$.
Can somebody explain me, why least squares fails to find this solution and actually breaks in this situation?

Best Answer

It's not always the case that least squares will give you a garbage result. Let's talk about what's happening.

If you're fitting a polynomial of degree more than n-1, then you have more variables than you do coefficients. If you create the Vandermonde matrix, it's a short and fat matrix, an underdetermined system. So there are an infinite number of solutions that satisfy your constraints (fitting the polynomial through the data). One of those solutions is the solution that happens when you fit the polynomial of degree n-1 to the data. It is possible to get that solution.

How does least squares determine the solution that's chosen? That depends on the implementation of least squares. A common choice, when solving $Ax=b$ is to make the solution $x^\star = A^\dagger b$, where $A^\dagger$ represents the pseudoinverse of $A$. In this case, $x^\star$ is defined to be the solution of the following optimization problem: \begin{align} \text{minimize} &\hspace{1em} \|x\|_2 \\ \text{subject to} &\hspace{1em} x \in \text{proj}_{Im(A)}(b). \end{align} That is, choose the smallest x such that $x$ is a projection of $b$ onto the image of $A$ (the column space of $A$).

It can be the case that higher order polynomial coefficients can be smaller and still have the polynomial fit the data. When that happens, and you use that polynomial for interpolation of other points, you will have much wider variation (higher frequencies) because you have higher order polynomial coefficients.

Related Question