Curve fitting with polynomials and inner products

least squareslinear regressionnumerical methodsnumerical-calculusregression

So, I was learning about the curve fitting methods, and the teacher was showing an example where he would have some data where each point was represented with $(x_1, x_2, x_3)$ and we had a measurement $y$ at that point. We wanted to approximate the $y$ with a linear combination of polynomials.

We first defined:

$$
\begin{matrix}
p_1(x_1, x_2, x_3) = 1 & p_2(x_1, x_2, x_3) = x_1 & p_3(x_1, x_2, x_3) = x_2 \\
p_4(x_1, x_2, x_3) = x_3 & p_5(x_1, x_2, x_3) = x_1^2 & p_6(x_1, x_2, x_3) = x_2^2 \\
p_7(x_1, x_2, x_3) = x_3^2 & p_8(x_1, x_2, x_3) = x_1x_2 & p_9(x_1, x_2, x_3) = x_1x_3 \\
p_{10}(x_1, x_2, x_3) = x_2x_3
\end{matrix}
$$

We proceeded to define a matrix $A$ such that:

$$
A = \begin{bmatrix}
\left<p_1, p_1\right> & \left<p_1, p_2\right> & \dots & \left<p_1, p_{10}\right> \\
\left<p_2, p_1\right> & \left<p_2, p_2\right> & \dots & \left<p_2, p_{10}\right> \\
\vdots & \vdots & \ddots & \vdots \\
\left<p_{10}, p_1\right> & \left<p_{10}, p_2\right> & \dots & \left<p_{10}, p_{10}\right> \\
\end{bmatrix}
$$

And $b$ such that:

$$
b = \begin{bmatrix}
\left<y, p_1\right> \\
\left<y, p_2\right> \\
\vdots \\
\left<y, p_{10}\right> \\
\end{bmatrix}
$$

Finally, we proceeded to solve the equation $A\vec{c} = \vec{b}$ where $\vec{c}$ is the vector with the coefficients such that (using $c$ because $x$ was already used for the features):

$$
f(x_1, x_2, x_3) = \sum c_ip_i(x_1, x_2, x_3)
$$

The thing is, I didn't really understand why we need to compute the inner product of those polynomials to create those matrices, my first idea would be to just compute the polynomials and fit using the Moore-Penrose inverse, what is the difference between those two approaches? I didn't really get the teacher's explanation and I can't find the explanation in any book.

Edit 1
So, I wasn't very clear on how I would approach the problem, instead of my professors solution, so I wanted to clarify with an example, let's say I had the data

$$
\begin{array}{|cc|c|}
\hline
x_1 & x_2 & y \\
\hline
1 & 2 & 1\\
1 & 3 & 2\\
2 & 1 & 3\\
2 & 2 & 2\\
\hline
\end{array}
$$

And I wanted to approximate with $f(x, y) = c_1 + c_2x_1 + c_3x_2 + c_4x_1^2 c_5x_2^2 + c_6x_1x_2$ I would create a matrix A like that:

$$
A = \begin{bmatrix}
1 & x_1 & x_2 & x_1^2 & x_2^2 & x_1x_2
\end{bmatrix} = \begin{bmatrix}
1 & 1 & 2 & 1 & 4 & 2 \\
1 & 1 & 3 & 1 & 9 & 3 \\
1 & 2 & 1 & 4 & 1 & 2 \\
1 & 2 & 2 & 4 & 4 & 4
\end{bmatrix}
$$

And have $b$ like that:

$$
b = \begin{bmatrix}
1 \\
2 \\
3 \\
2
\end{bmatrix}
$$

and proceed to solve $A\vec{c} = \vec{b}$ to find the coefficients, without computing the inner product.

Edit 2 So, another clarification would be on how the inner product is defined on my teacher's example, let's say we have the same data as my example before, then:

$$
\left<p_2, p_3\right> = 1\cdot2 + 1\cdot3 + 2\cdot1 + 2\cdot2 = 11
$$

Also, right now it's an exact fit because I have more features than data, which I did to leave the example shorter, but you can imagine that in the real scenario I have much more data. Also, the best-fit is the one that minimizes the squared error.

Best Answer

Per your edit: In your method, each data point gives a row to $A$. If you have a lot more data points, then you have a lot more rows. If $A$ has more rows than columns, then $A\vec c = \vec b$ is unlikely to have any solution, as it is overly constrained. You can solve this by increasing the maximum degree of your polynomial until you have at least as many columns as rows again. But you've given up the ability to limit the degree of your answer, and as mentioned in my original answer, that leads to wildly oscillating polynomials.

Note that in your professor's version, the size of $A$ is not dependent on the number of data points. Only the definition of the inner products change with that. $A$ is always a square matrix. And since the polynomials themselves are chosen to be independent of each other, you are guaranteed that $A\vec c = \vec b$ has at least one solution, even when $A$ is not invertible.

Also, you can prove that the professor's version comes up with a solution that minimizes squared error. Your method almost certainly will not.


Original Post:

How you produce a best-fit polynomial depends on your definition of "best-fit". Your example defines it as "exact match at these four points", and cares not the slightest how the polynomial behaves anywhere else. Such polynomials have a tendency to swing wildly between points. For functions of one variable that are almost linear the interpolating polynomial will generally end up looking sinusoidal over the range of interest instead of the nice flat line you would want.

How your professor's method defines "best-fit" is not clear, because you did not indicate which inner product of polynomials is being used. (And it has been ages since I've dealt with this, so I'd probably have to think hard about what it was doing before I could answer anyway.)