Solved – Why does Mean Square Error decreases when I fit higher degree of polynomial

errorpolynomialregression

I understand the intuition of polynomial regression with higher degree are able to fit the data better, as it is able to decrease your bias but increases the variance. However, I am unable to proof them mathematically. Can anyone help to explain to me why this might be true, i.e it decreases $|| Y – X \beta ||^2 $ ? Thank you!

Best Answer

Consider your matrix $X$ consists of one column vector $x_1$. That is, a single feature.

In linear regression, $\beta=(X^TX)^{-1}X^Ty=(x_1^Tx_1)^{-1}x_1^Ty$.

If you add a new variable $x_2$ of degree 2, obtaining $X_2$, then $$ y-X_2(X_2^TX_2)^{-1}X_2^Ty = y-x_1(x_1^Tx_1)^{-1}x_1^Ty -x_2(x_2^Tx_2)^{-1}x_2^T\tilde y_1 $$ where $\tilde y_1 = y-x_1(x_1^Tx_1)^{-1}x_1^Ty$.

For any matrix $X$, $\pi_X(y)=X(X^TX)^{-1}X^Ty$ is the orthogonal projection of $y$ onto the space spanned by the columns of $X$, which means that $|\pi_X(y)_i| \leq |y_i|$ for every $i$, where $y_i$ denotes the $i$-th entry of vector $y$.

This means that $$ \|y-X_2(X_2^TX_2)^{-1}X_2^Ty\|_2^2 \leq \|y-X(X^TX)^{-1}X^Ty \|_2^2 $$ Equality happens only when $x_1(x_1^Tx_1)^{-1}x_1^Tx_2=x_2$, that is, when the addition of column $x_2$ does not increase the rank of $X$.

More generally, if $X_n$ denotes the matrix resulting from adding polynomial variables up to degree $n$, then $$ y-X_n(X_n^TX_n)^{-1}X_n^Ty = y-X_{n-1}(X_{n-1}^TX_{n-1})^{-1}X_{n-1}^Ty -x_n(x_n^Tx_n)^{-1}x_n^T\tilde y_{n-1} $$ and the same reasoning applies.

If the inverse does not exist, one can replace $(X^TX)^{-1}X^T$ the Moore-Penrose pseudoinverse, $X^+$.

Related Question