Solved – Maximum Degree of Polynomial Regression

machine learningpolynomialregression

If we have 100 data points and want to perform polynomial regression, the maximum degree of our polynomial is n-1, where n is the number of data points. In this case, the maximum degree would be 99. I am aware that this will almost always overfit the data, but my question is this: why can we only go up to 99? What is to stop us from taking a polynomial of degree, say, 14 million, and fitting the data?

I am aware of the n-1 bound, but haven't found a good explanation online as to why this bound exists.

Best Answer

In short, what you are doing to fit a polynomial regression model like \begin{align*} Y&=\beta_0+\beta_1X+\beta_2X^2+\cdots+\beta_pX^p+\varepsilon\\ &=\mathbf{X}\boldsymbol\beta+\varepsilon \end{align*} is to perform a linear regression model with $p$ variables given by $(X,X^2,\ldots,X^p)$, so your design matrix, from a sample $\{(X_i,Y_i)\}_{i=1}^n$ is $$ \mathbf{X}=\pmatrix{\begin{array}{ccc}1 & X_1 & \cdots & X_1^p\\ \vdots & \vdots &\ddots &\vdots\\ 1 & X_n & \cdots & X_n^p \end{array}}_{n\times(p+1)}. $$ The estimate for $\boldsymbol\beta$ is $$ \hat{\boldsymbol\beta}=(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}\mathbf{y}, $$ with $\mathbf{y}=(Y_1,\ldots,Y_n)^T$.

The important point is that $\mathbf{X}^T\mathbf{X}$ is a $(p+1)\times(p+1)$ matrix and that $\mathrm{rank}(\mathbf{X}^T\mathbf{X})=\mathrm{rank}(\mathbf{X})$. Then, if $p=n-1$, $\mathbf{X}^T\mathbf{X}$ has dimension $n\times n$ and its rank is $n$, so no problem, is invertible. But if $p=n$, the dimension of $\mathbf{X}^T\mathbf{X}$ is $(n+1)\times(n+1)$ and the rank remains $n$, so in that case (and also if $p>n$) is not invertible. In other words, linear dependence arises as the result of fitting more variables than sample points you have.

Related Solutions

Polynomial Regression – Adding Polynomial Slope Constraint in R Linear Regression

First of all, let's write the unconstrained model so that the coefficients are consistently ordered (in this case, from lowest to highest degree):

unconstrained_model <- lm(y ~ x + I(x^2) +I(x^3))

Secondly, constrained least square regression always has higher RMSE on the data sample than unconstrained least square, unless the latter satisfies the constraints, in which case the two solutions are the same. In practice, if the data don't respect the constraints very well, the constrained fit can be quite bad.

In your case the data are decreasing, which is good. But you also have 4 data points where the ordinate is nearly constant (with respect to the range of y, $I_y =[\min(y),\max(y)]$), even if the corresponding variation in x is considerable, with respect to $I_x = [\min(x),\max(x)]$.In other words, you have a large plateau in the center of your sample:

df <- data.frame(x,y)
library(ggplot2)
(p <- ggplot(data=df,aes(x=x,y=y)) + geom_line()+geom_point()+geom_smooth())

This is not a pattern which can be easily fit by a monotone decreasing third degree polynomial.

Having said that, let's solve the constrained regression problem. First of all let's get explicit equations for the constraints in terms of of the model coefficients. You already noted that the third degree polynomial is decreasing in $I=[0,1]$ if and only if

$$3ax^2+2bx+c <0 \quad \forall x \in I $$

Note that $d$ doesn't appear in this inequality, and the reason is obvious -whether our model (the third degree polynomial) is increasing or not, doesn't depend on the value of the intercept. Since $3ax^2+2bx+c$ must be negative in 0, we get one of the constraints as $c<0$. Now, to get the other constraint inequalities, we just need to make the substitutions

$$ t_1=x,\quad t_2=x^2$$

and note that

$$x\in[0,1]\Rightarrow(t_1,t_2)\in [0,1]\times [0,1]$$

We are then led to the simpler problem of imposing a negativity constraint on a linear (degree one) polynomial in two variables:

$$f(t_1,t_2)=3at_2+2bt_1+c <0 \quad \forall (t_1,t_2) \in [0,1]\times[0,1] $$

$f$ is linear and its domain, the unit square $[0,1]^2$, is convex. Then if $f$ is positive in the four corners $(0,0)$, $(0,1)$, $(1,0)$, $(1,1)$, it is positive everywhere in $[0,1]^2$. Actually, the only possible values for $(t_1,t_2)$ are those in the lower-right half of the square (the triangle with corners $(0,0)$, $(0,1)$, $(1,1)$), because

$$x \in [0,1] \Rightarrow x^2 \le x \Leftrightarrow t_2 \le t_1 $$

Thus, we only need to impose that $f$ is negative in the three corners $(0,0)$, $(0,1)$, $(1,1)$. The negativity condition in $(0,0)$ gives again $c<0$. Imposing it also in $(1,0)$ and $(1,1)$, we finally obtain the three conditions

$$c<0,\quad3a+c <0, \quad 3a+2b+c <0$$

Perfect! Now we have three linear inequality constraints which express the condition that our model is negative in $I$. To solve a linear least square problem with equality and/or inequality constraints, R offers a package with a simple and intuitive interface, limSolve.

library(limSolve)
A <- cbind(rep(1,length(x)),x,x^2,x^3)
b <- y
G <- matrix(nrow=3,ncol=4,byrow = TRUE,data = c(0, -1,-2,-3,0,-1,-2,0,0,-1,0,0))
h <- rep(0,3)
constrained_model <- lsei(A = A, B = b, G = G, H = h, type=2)

Great, now let's get predictions for both models. lm objects have a predict method, but no equivalent method is available for the results of lsei. Thus we'll use a different approach and define a my_predict function

my_predict <- function(x,coefficients){
X <- cbind(rep(1,length(x)),x,x^2,x^3)
predictions <- X%*%coefficients
}

Then

# compute predictions 
xpred <- seq(0,1,len=100)
predictions_constrained <- my_predict(xpred,constrained_model$X)
predictions_unconstrained <-    my_predict(xpred,unconstrained_model$coefficients)
df2 <- data.frame(xpred,predictions_unconstrained,predictions_constrained)

# plot results
p <- ggplot(data = df,aes(x = x, y = y,color = "data")) + 
geom_point() +
geom_line(data = df2, aes(x = xpred, y = predictions_unconstrained, color = "unconstrained fit")) +
geom_line(data = df2, aes(x = xpred, y = predictions_constrained, color = "constrained fit"))
p

As expected, the constrained fit is considerably worse than the unconstrained one. We can also compute the $R^2$, and get the same picture:

SS <-sum((y-mean(y))^2)
RSS_unconstrained <- sum((unconstrained_model$residuals)^2)
RSS_constrained <- sum((A%*%constrained_model$X-y)^2)
R2_constrained <- 1-RSS_constrained/SS
R2_unconstrained <- 1-RSS_unconstrained/SS

The constrained model explains only about 70% of the total variation, while the unconstrained model explains about 83% of the total variation. As a final note, the fact that the fit is worse on the sample data doesn't necessarily mean that the constrained model is worse than the unconstrained one, in terms of predictive accuracy. If, for example, the ideal model obeys the constraints, it may be that the constrained model has lower test error on unseen data.

Concerning the new set of data:

x <- c(0.01041667, 0.30208333, 0.61458333, 0.65625000, 0.83333333)
y <- c(772, 607, 576, 567, 550)

Repeating exactly the same calculations leads to a constrained model having the highest degree term equal to zero:

> constrained_model
$X
                  x                     
 759.8756 -449.1166  224.5583    0.0000 

$residualNorm
[1] 0

$solutionNorm
[1] 1854.049

$IsError
[1] FALSE

$type
[1] "lsei"

This polynomial is nonincreasing on $I=[0,1]$, as required, but it has a minimum in $x=1$, thus it would start increasing for $x>1$. As a matter of fact, the derivative is $2bx+c$ (since $a=0$), which for $x=1$ becomes

> 2*constrained_model$X[3]+constrained_model$X[2]

-6.82121e-13

which, given the accuracy of the computations involved, is basically 0. The fact that $x=1$ is a minimum is also evident from the plot:

This

Solved – R polynomial expansion error on “unique points” and degree

In R, the poly function generates a polynomial and then does a QR decomposition to make it orthogonal. For high degree polynomial input, you'll get a high condition number for the Vandermode matrix, and when poly tries to make the columns orthogonal, it's going to throw an error (because the input looks rank deficient within the limits of double precision floating point).

Consider:

x <- poly(runif(100),30,raw=TRUE)
QR <- qr(x)
QR$rank

This is basically what's in the source code of poly. QR$rank is going to be far lower than the degree.

Best Answer

Related Solutions

Polynomial Regression – Adding Polynomial Slope Constraint in R Linear Regression

Solved – R polynomial expansion error on “unique points” and degree

Related Question