I understand the intuition of polynomial regression with higher degree are able to fit the data better, as it is able to decrease your bias but increases the variance. However, I am unable to proof them mathematically. Can anyone help to explain to me why this might be true, i.e it decreases $|| Y – X \beta ||^2 $ ? Thank you!
Solved – Why does Mean Square Error decreases when I fit higher degree of polynomial
errorpolynomialregression
Related Solutions
The data turned out to be generated by some cosine-alike function with a low density. This caused the nearest neighbours classifier to perform well for small $k$ (it matches the cosine, because it takes into account the points that were far into the "other-class-zone") and for large $k$ (it matches an almost linear function because it ignores ALL points that are on the "other side"). Everything in between confuses the classifier, because depending on which points are training and which are test data, the classification border alters a lot. This causes the variance and the bias to be higher for a moderate $k$ than for extreme low/high $k$...
Let's ignore the penalty term for a moment, while we explore the sensitivity of the solution to changes in a single observation. This has ramifications for all linear least-squares models, not just Ridge regression.
Notation
To simplify the notation, let $X$ be the model matrix, including a column of constant values (and therefore having $p+1$ columns indexed from $0$ through $p$), let $y$ be the response $n$-vector, and let $\beta=(\beta_0, \beta_1, \ldots, \beta_p)$ be the $p+1$-vector of coefficients. Write $\mathbf{x}_i = (x_{i0}, x_{i1}, \ldots, x_{ip})$ for observation $i$. The unpenalized objective is the (squared) $L_2$ norm of the difference,
$$RSS(\beta)=||y - X\beta||^2 = \sum_{i=1}^n (y_i - \mathbf{x}_i\beta)^2.\tag{1}$$
Without any loss of generality, order the observations so the one in question is the last. Let $k$ be the index of any one of the variables ($0 \le k \le p$).
Analysis
The aim is to expose the essential simplicity of this situation by focusing on how the sum of squares $RSS$ depends on $x_{nk}$ and $\beta_k$--nothing else matters. To this end, split $RSS$ into the contributions from the first $n-1$ observations and the last one:
$$RSS(\beta) = (y_n - \mathbf{x}_n\beta)^2 + \sum_{i=1}^{n-1} (y_i - \mathbf{x}_i\beta)^2.$$
Both terms are quadratic functions of $\beta_k$. Considering all the other $\beta_j,$ $j\ne k$, as constants for the moment, this means the objective can be written in the form
$$RSS(\beta_k) = (x_{nk}^2 \beta_k^2 + E\beta_kx_{nk} + F) + (A^2\beta_k^2 + B\beta_k + C).$$
The new quantities $A\cdots F$ do not depend on $\beta_k$ or $x_{nk}$. Combining the terms and completing the square gives something in the form
$$RSS(\beta_k) = \left(\beta_k\sqrt{x_{nk}^2 + A^2} + \frac{Ex_{nk}+B}{2\sqrt{x_{nk}^2+A^2}} \right)^2 + G - \frac{(Ex_{nk}+B)^2}{4(x_{nk}^2+A^2)}\tag{2}$$
where the quantity $G$ does not depend on $x_{nk}$ or $\beta_k$.
Estimating sensitivity
We may readily estimate the sizes of the coefficients in $(2)$ when $|x_{nk}|$ grows large compared to $|A|$. When that is the case,
$$RSS(\beta_k) \approx \left(\beta_k x_{nk} + E/2\right)^2 + G-E^2/4.$$
This makes it easy to see what changing $|x_{nk}|$ must do to the optimum $\hat\beta_k$. For sufficiently large $|x_{nk}|$, $\beta_k$ will be inversely proportional to $x_{nk}$.
We actually have learned, and proven, much more than was requested, because Ridge regression can be formulated as model $(1)$. Specifically, to the original $n$ observations you will adjoin $p+1$ fake observations of the form $\mathbf{x}_{n+i} = (0,0,\ldots, 0,1,0,\ldots,0)$ and then you will multiply them all by the penalty parameter $\lambda$. The preceding analysis shows that for $\lambda$ sufficiently large (and "sufficiently" can be computed in terms of $|A|$, which is a function of the actual data only), every one of the $\hat\beta_k$ will be approximately inversely proportional to $\lambda$.
An analysis that requires some more sophisticated results from Linear Algebra appears at The proof of shrinking coefficients using ridge regression through "spectral decomposition". It does add one insight: the coefficients in the asymptotic relationships $\hat\beta_k \sim 1/\lambda$ will be the reciprocal nonzero singular values of $X$.
Best Answer
Consider your matrix $X$ consists of one column vector $x_1$. That is, a single feature.
In linear regression, $\beta=(X^TX)^{-1}X^Ty=(x_1^Tx_1)^{-1}x_1^Ty$.
If you add a new variable $x_2$ of degree 2, obtaining $X_2$, then $$ y-X_2(X_2^TX_2)^{-1}X_2^Ty = y-x_1(x_1^Tx_1)^{-1}x_1^Ty -x_2(x_2^Tx_2)^{-1}x_2^T\tilde y_1 $$ where $\tilde y_1 = y-x_1(x_1^Tx_1)^{-1}x_1^Ty$.
For any matrix $X$, $\pi_X(y)=X(X^TX)^{-1}X^Ty$ is the orthogonal projection of $y$ onto the space spanned by the columns of $X$, which means that $|\pi_X(y)_i| \leq |y_i|$ for every $i$, where $y_i$ denotes the $i$-th entry of vector $y$.
This means that $$ \|y-X_2(X_2^TX_2)^{-1}X_2^Ty\|_2^2 \leq \|y-X(X^TX)^{-1}X^Ty \|_2^2 $$ Equality happens only when $x_1(x_1^Tx_1)^{-1}x_1^Tx_2=x_2$, that is, when the addition of column $x_2$ does not increase the rank of $X$.
More generally, if $X_n$ denotes the matrix resulting from adding polynomial variables up to degree $n$, then $$ y-X_n(X_n^TX_n)^{-1}X_n^Ty = y-X_{n-1}(X_{n-1}^TX_{n-1})^{-1}X_{n-1}^Ty -x_n(x_n^Tx_n)^{-1}x_n^T\tilde y_{n-1} $$ and the same reasoning applies.
If the inverse does not exist, one can replace $(X^TX)^{-1}X^T$ the Moore-Penrose pseudoinverse, $X^+$.