Solved – B-Splines VS high order polynomials in regression

multiple regressionpolynomialregressionregularizationsplines

I do not have a specific example or task in mind. I'm just new on using b-splines and I wanted to get a better understanding of this function in the regression context.

Let's assume that we want to assess the relationship between the response variable $y$ and some predictors $x_1, x_2,…,x_p$. The predictors include some numerical variables as well as some categorical ones.

Let's say that after fitting a regression model, one of the numerical variables e.g $x_1$ is significant. A logical step afterwards is to assess whether higher order polynomials e.g: $x_1^2$ and $x_1^3$ are required in order to adequately explain the relationship without overfitting.

My questions are:

At what point do you chose between b-splines or simple higher order polynomial. e.g in R:
```
y ~ poly(x1,3) + x2 + x3
```
vs
```
 y ~ bs(x1,3) + x2 + x3
```
How can you use plots to inform your choice between those two and what happens if it's not really clear from the plots (e.g: due to massive amounts of data points)
How would you assess the two-way interaction terms between $x_2$ and let's say $x_3$
How do the above change for different types of models
Would you consider to never use high order polynomials and always fitting b-splines and penalise the high flexibility?

Best Answer

I would usually only consider splines rather than polynomials. Polynomials cannot model thresholds and are often undesirably global, i.e., observations at one range of the predictor have a strong influence on what the model does at a different range (Magee, 1998, The American Statistician and Frank Harrell's Regression Modeling Strategies). And of course restricted splines which are linear outside the extremal knots are better for extrapolation, or even intrapolation at extreme values of the predictors.

One case where you may want to consider polynomials is when it is important to explain your model to a nontechnical audience. People understand polynomials better than splines. (Edit: Matthew Drury points out that people may only think they understand polynomials better than splines. I won't take sides on this question.)

Plots are often not very useful in deciding between different ways of dealing with nonlinearity. Better to do cross-validation. This will also help you assess interactions, or find a good penalization.

Finally, my answer doesn't change with the kind of model, because the points above are valid for any statistical or ML model.

The setting

We are considering an $n\times k$ model matrix $\mathbb X$ of potential explanatory variables in some kind of regression. This means we're thinking of the columns of $\mathbb X$ as being $n$-vectors $X_1, X_2, \ldots, X_k$ and we will be forming linear combinations of them, $\beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_k X_k,$ to predict or estimate a response.

Sometimes a regression can be improved by introducing additional columns created by multiplying various columns of $X$ by each other, coefficient by coefficient. Such products are called "monomials" and can be written like

$$X_1^{d_1} X_2^{d_2} \cdots X_k^{d_k}$$

where each "power" $d_i$ is zero or greater, representing how many times each $X_1$ appears in the product. Notice that $X^0$ is an $n$-vector of constant coefficients ($1$) and $X^1=X$ itself. Thus, monomials (as vectors) generate a vector space that includes the original column space of $\mathbb X.$ The possibility that it might be a larger vector space gives this procedure greater scope to model the response with linear combinations.

We intend to replace the original model matrix $\mathbb X$ by a collection linear combinations of monomials. When the degree of at least one of these monomials exceeds $1,$ this is called polynomial regression.

Gradings of polynomials

The degree of a monomial is the sum of its powers, $d_1+d_2+\ldots+d_k.$ The degree of a linear combination of monomials (a "polynomial") is the largest degree among the monomial terms with nonzero coefficients. The degree has an intrinsic meaning, because when you change the basis of the original vector space, each vector $X_i$ is newly represented by a linear combination of all the vectors; monomials $X_1^{d_1} X_2^{d_2} \cdots X_k^{d_k}$ thereby become polynomials of the same degree; and consequently the degree of any polynomial is unchanged.

The degree provides a natural "grading" to this polynomial algebra: the vector space generated by all linear combinations of monomials in $X$ of degree up to and including $d+1,$ called the "polynomials of [or up to] degree $d+1$ in $X,$" extends the vector space of polynomials up to degree $d$ in $X.$

Uses of polynomial regression

Often, polynomial regression is exploratory in the sense that we don't know at the outset which monomials to include. The process of creating new model matrices out of monomials and re-fitting the regression may need to be repeated many times, perhaps an astronomical number of times in some machine learning settings.

The chief problems with this approach are

Monomials often introduce problematic amounts of "multicollinearity" in the new model matrix, primarily because powers of a single variable tend to be highly collinear. (Collinearity among powers of two different variables is unpredictable, because it depends on how those variables are related, and therefore is less predictable.)
Changing just a single column of the model matrix, or introducing a new one, or deleting one, may require a "cold restart" of the regression procedure, potentially taking a long time for computation.

The gradings of polynomial algebras provide a way to overcome both problems.

Orthogonal polynomials in one variable

Given a single column vector $X,$ a set of "orthogonal polynomials" for $X$ is a sequence of column vectors $p_0(X), p_1(X), p_2(X),\ldots$ formed as linear combinations of monomials in $X$ alone--that is, as powers of $X$--with the following properties:

For each degree $d=0, 1, 2, \ldots, $ the vectors $p_0(X), p_1(X), \ldots, p_d(X)$ generate the same vector space as $X^0, X^1, \ldots, X^d.$ (Notice that $X^0$ is the $n$-vector of ones and $X^1$ is just $X$ itself.)
The $p_i(X)$ are mutually orthogonal in the sense that for $i\ne j,$ $$p_i(X)^\prime p_j(X) = 0.$$

Usually, the replacement model matrix $$\mathbb{P} = \pmatrix{p_0(X) & p_1(X) & \cdots & p_d(X)}$$ formed from these monomials is chosen to be orthonormal by normalizing its columns to unit length: $$\mathbb{P}^\prime \mathbb{P} = \mathbb{I}_{d+1}.$$ Because the inverse of $\mathbb{P}^\prime \mathbb{P}$ appears in most regression equations and the inverse of the identity matrix $\mathbb{I}_{d+1}$ is itself, this represents a huge computational gain.

Orthonormality very nearly determines the $p_i(X).$ You can see this by construction:

The first polynomial, $p_0(X),$ must be a multiple of the $n$-vector $\mathbf{1}=(1,1,\ldots,1)^\prime$ of unit length. There are only two choices, $\pm \sqrt{1/n}\mathbf{1}.$ It is customary to pick the positive square root.
The second polynomial, $p_1(X),$ must be orthogonal to $\mathbf{1}.$ It can be obtained by regressing $X$ against $\mathbf{1},$ whose solution is the vector of mean values $\hat X = \bar{X}\mathbf{1}.$ If the residuals $\epsilon = X - \hat X$ are not identically zero, they give the only two possible solutions $p_1(X) = \pm \left(1/||\epsilon||\right)\,\epsilon.$

...

Generally, $p_{d+1}(X)$ is obtained by regressing $X^{d+1}$ against $p_0(X), p_1(X), \ldots, p_d(X)$ and rescaling the residuals to be a vector of unit length. There are two choices of sign when the residuals are not all zero. Otherwise, the process ends: it will be fruitless to look at any higher powers of $X.$ (This is a nice theorem but its proof need not distract us here.)

This is the Gram-Schmidt process applied to the intrinsic sequence of vectors $X^0, X^1, \ldots, X_d, \ldots.$ Usually it is computed using a QR decomposition, which is very nearly the same thing but calculated in a numerically stable manner.

This construction yields a sequence of additional columns to consider including in the model matrix. Polynomial regression in one variable therefore usually proceeds by adding elements of this sequence one by one, in order, until no further improvement in the regression is obtained. Because each new column is orthogonal to the previous ones, including it does not change any of the previous coefficient estimates. This makes for an efficient and readily interpretable procedure.

Polynomials in multiple variables

Exploratory regression (as well as model fitting) usually proceeds by first considering which (original) variables to include in a model; then assessing whether those variables could be augmented by including various transformations of them, such as monomials; and then introducing "interactions" formed from products of these variables and their re-expressions.

Carrying out such a program, then, would start with forming univariate orthogonal polynomials in the columns of $\mathbb X$ separately. After selecting a suitable degree for each column, you would then introduce interactions.

At this point, parts of the univariate program break down. What sequence of interactions would you apply, one by one, until a suitable model is identified? Moreover, now that we have truly entered the realm of multivariable analysis, the numbers of options available and their growing complexity suggest there may be diminishing returns in constructing a sequence of multivariate orthogonal polynomials. If, however, you had such a sequence in mind, you could compute it using a QR decomposition.

What `R` does

Software for polynomial regression therefore tends to focus on computing univariate orthogonal polynomial sequences. It is characteristic for R to extend such support as automatically as possible to groups of univariate polynomials. This what poly does. (Its companion polym is essentially the same code, with a fewer bells and whistles; the two functions do the same things.)

Specifically, poly will compute a sequence of univariate orthogonal polynomials when given a single vector $X,$ stopping at a specified degree $d.$ (If $d$ is too large--and it can be difficult to predict how large is too large--it unfortunately throws an error.) When given a set of vectors $X_1, \ldots, X_k$ in the form of a matrix $\mathbb X,$ it will return

Sequences of orthonormal polynomials $p_1(X_j), p_2(X_j), \ldots, p_d(X_j)$ for each $j$ out to a requested maximum degree $d.$ (Since the constant vector $p_0(X_i)$ is common to all variables and is so simple--it's usually accommodated by the intercept in the regression--R does not bother to include it.)
All interactions among those orthogonal polynomials up to and including those of degree $d.$

Step (2) involves several subtleties. Usually by an "interaction" among variables we mean "all possible products," but some of those possible products will have degrees greater than $d.$ For instance, with $2$ variables and $d=2,$ R computes

$$p_1(X_1),\quad p_2(X_1),\quad p_1(X_2),\quad p_1(X_1)p_1(X_2),\quad p_2(X_2).$$

R does not include the higher-degree interactions $p_2(X_1)p_1(X_2),$ $p_1(X_1)p_2(X_2)$ (polynomials of degree 3) or $p_1(X_2)p_2(X_2)$ (a polynomial of degree 4). (This is not a serious limitation because you can readily compute these products yourself or specify them in a regression formula object.)

Another subtlety is that no kind of normalization is applied to any of the multivariate products. In the example, the only such product is $p_1(X_1)p_1(X_2).$ However, there is no guarantee even that its mean will be zero and it almost surely will not have unit norm. In this sense it is a true "interaction" between $p_1(X_1)$ and $p_1(X_2)$ and as such can be interpreted as interactions usually are in a regression model.

An example

Let's look at an example. I have randomly generated a matrix $$\mathbb{X} = \pmatrix{1 & 3 \\ 5 & 6 \\ 2 & 4}.$$ To make the calculations easier to follow, everything is rounded to two significant figures for display.

The orthonormal polynomial sequence for the first column $X_1 = (1,5,2)^\prime$ begins by normalizing $X_1^0= (1,1,1)^\prime$ to unit length, giving $p_0(X_1) = (1,1,1)^\prime/\sqrt{3} \approx(0.58,0.58,0.58)^\prime.$ The next step includes $X_1^1 = X_1$ itself. To make it orthogonal to $p_0(X_1),$ regress $X_1$ against $p_0(X_1)$ and set $p_1(X_1)$ equal to the residuals of that regression, rescaled to have unit length. The result is the usual standardization of $X_1$ obtained by recentering it and dividing by its standard deviation, $p_1(X_1) = (-0.57,0.79,-0.23)^\prime.$ Finally, $X_1^2 = (1,25,4)$ is regressed against $p_0(X_1)$ and $p_1(X_1)$ and those residuals are rescaled to unit length. We cannot go any further because the powers of $X_1$ cannot generate a vector space of more than $n=3$ dimensions. (We got this far because the minimal polynomial of the coefficients of $X_1,$ namely $(t-1)(t-5)(t-4),$ has degree $3,$ demonstrating that all monomials of degree $3$ or larger are linear combinations of lower powers and those lower powers are linearly independent.)

The resulting matrix representing an orthonormal polynomial sequence for $X_1$ is

$$\mathbb{P_1} = \pmatrix{0.58 & -0.57 & 0.59 \\ 0.58 & 0.79 & 0.20 \\ 0.58 & -0.23 & -0.78}$$

(to two significant figures).

In the same fashion, an orthonormal polynomial matrix for $X_2$ is

$$\mathbb{P_2} = \pmatrix{0.58 & -0.62 & 0.53 \\ 0.58 & 0.77 & 0.27 \\ 0.58 & -0.15 & -0.80}.$$

The interaction term is the product of the middle columns of these matrices, equal to $(0.35, 0.61, 0.035)^\prime.$ The full matrix created by poly or polym, then, is

$$\mathbb{P} = \pmatrix{-0.57 & 0.59 & -0.62 & 0.35 & 0.53 \\ 0.79 & 0.20&0.77& 0.61& 0.27 \\ -0.23 & -0.78 & -0.15 & 0.035 & -0.80}.$$

Notice the sequence in which the columns are laid out: the non-constant orthonormal polynomials for $X_1$ are in columns 1 and 2 while those for $X_2$ are in columns 3 and 5. Thus, the only orthogonality that is guaranteed in this output is between these two pairs of columns. This is reflected in the calculation of $\mathbb{P}^\prime\mathbb{P},$ which will have zeros in positions $(1,2), (2,1), (3,5),$ and $(5,3)$ (shown in red below), *but may be nonzero anywhere else, and will have ones in positions $(1,1), (2,2), (3,3),$ and $(5,5)$ (shown in blue below), but is likely not to have a one in the other diagonal positions ($(4,4)$ in this example). Indeed,

$$\mathbb{P}^\prime\,\mathbb{P} = \pmatrix{\color{blue}{\bf 1} & \color{red}{\bf 0} & 1 & 0.28 & 0.091 \\ \color{red}{\bf 0} & \color{blue}{\bf 1} & -0.091 & 0.3 & 1 \\ 1 & -0.091 & \color{blue}{\bf 1} & 0.25 & \color{red}{\bf 0} \\ 0.28 & 0.3 & 0.25 & 0.5 & 0.32 \\ 0.091 & 1 & \color{red}{\bf 0} & 0.32 & \color{blue}{\bf 1}}.$$

When you inspect the $\mathbb P$ matrix shown in the question, and recognize that multiples of $10^{-17}$ are really zeros, you will observe that this pattern of zeros in the red positions holds. This is the sense in which those bivariate polynomials are "orthogonal."

Best Answer

Related Solutions

Solved – the difference between GLM and splines

Solved – What are multivariate orthogonal polynomials as computed in R