Solved – What are multivariate orthogonal polynomials as computed in R

multiple regressionorthogonalpolynomialr

Orthogonal polynomials in an univariate set of points are polynomials that produce values on that points in a way that its dot product and pairwise correlation are zero. R can produce orthogonal polynomials with function poly.

The same function has a variant polym that produces orthogonal polynomials on a multivariate points set. Anyway the resulting polynomials are not orthogonal in the sense of having pairwise zero correlation. In fact, since the first order polynomials are supposed to be the just the original variables, first order polynomials won't be orthogonal unless the original variables are uncorrelated.

Then, my questions are:

  • What are the multivariate orthogonal polynomials computed by polym in R? Are they just the product of the univariate orthogonal polynomials? What are they used for?
  • Can exist true multivariate orthogonal polynomials? Is there an easy way to produce them? In R? Are they actually used in regression?

Update

In response to Superpronker's comment, I give one example of what I mean with uncorrelated polynomials:

> x<-rnorm(10000)
> cor(cbind(poly(x,degree=3)))
              1             2             3
1  1.000000e+00 -6.809725e-17  2.253577e-18
2 -6.809725e-17  1.000000e+00 -2.765115e-17
3  2.253577e-18 -2.765115e-17  1.000000e+00

Poly function returns the orthogonal polynomials evaluated in points x (here 10,000 points for each polynomial). Correlation between values on different polynomials is zero (with some numerical error).

When using multivariate polynomials, correlations are different than zero:

> x<-rnorm(1000)
> y<-rnorm(1000)
> cor(cbind(polym(x,y,degree=2)))
              1.0           2.0           0.1         1.1           0.2
1.0  1.000000e+00  2.351107e-17  2.803716e-02 -0.02838553  3.802363e-02
2.0  2.351107e-17  1.000000e+00 -1.899282e-02  0.10336693 -8.205039e-04
0.1  2.803716e-02 -1.899282e-02  1.000000e+00  0.05426440  5.974827e-17
1.1 -2.838553e-02  1.033669e-01  5.426440e-02  1.00000000  8.415630e-02
0.2  3.802363e-02 -8.205039e-04  5.974827e-17  0.08415630  1.000000e+00

Therefore, I don't understand in what sense those bivariate polynomials are orthogonal.

Update 2

I want to clarify the meaning of "orthogonal polynomials" used in regression because this context can be somehow misleading when applying the ideas from orthogonal polynomials in connected intervals – as in last Superpronker's comment.

I quote Julian J. Faraway's Practical Regression and Anova using R pages 101 and 102:

Orthogonal polynomials get round this problem by defining
$$z_1=a_1+b_1x$$
$$z_2= a_2+b_2x+c_2x^2$$
$$z_3= a_3+b_3x+c_3x^2+d_3x^3$$
etc. where the coefficients a, b, c… are chosen so that $z_i^T·z_j=0$ when $i \neq j$. The z are called orthogonal polynomials.

By an slight abuse of language, here the author uses $z_i$ both for the polynomial (as a function) and for the vector of the values the polynomial takes in the points of the set $x$. Or maybe it isn't even an abuse of language at all because since the beginning of the book $x$ has been the predictor (e.g. the set of values taken by the predictor).

This meaning of orthogonal polynomials isn't actually different from orthogonal polynomials on an interval. We can define orthogonal polynomials in the usual way (using integrals) over any measurable set with any measure function. Here we have a finite set ($x$) and we are using dot product instead of integral, but that's still orthogonal polynomials if we take our measure function as the Dirac delta in the points of our finite set.

And in relation to correlation: dot product of orthogonal vectors in $R^n$ (as the image of a orthogonal vectors on a finite set). If dot product of two vectors is zero, covariance is zero, and if covariance is zero correlation is zero. In the context of linear models is very useful to relate "orthogonal" and "uncorrelated", as in "orthogonal design of experiments".

Best Answer

Let's explore what's happening. I'm sure you already know most of the following material, but to establish notation and definitions and to make the ideas clear, I will cover the basics of polynomial regression before answering the question. If you like, jump to the heading "What R does" about two-thirds of the way into this post, and then skip back for any definitions you might need.

The setting

We are considering an $n\times k$ model matrix $\mathbb X$ of potential explanatory variables in some kind of regression. This means we're thinking of the columns of $\mathbb X$ as being $n$-vectors $X_1, X_2, \ldots, X_k$ and we will be forming linear combinations of them, $\beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_k X_k,$ to predict or estimate a response.

Sometimes a regression can be improved by introducing additional columns created by multiplying various columns of $X$ by each other, coefficient by coefficient. Such products are called "monomials" and can be written like

$$X_1^{d_1} X_2^{d_2} \cdots X_k^{d_k}$$

where each "power" $d_i$ is zero or greater, representing how many times each $X_1$ appears in the product. Notice that $X^0$ is an $n$-vector of constant coefficients ($1$) and $X^1=X$ itself. Thus, monomials (as vectors) generate a vector space that includes the original column space of $\mathbb X.$ The possibility that it might be a larger vector space gives this procedure greater scope to model the response with linear combinations.

We intend to replace the original model matrix $\mathbb X$ by a collection linear combinations of monomials. When the degree of at least one of these monomials exceeds $1,$ this is called polynomial regression.

Gradings of polynomials

The degree of a monomial is the sum of its powers, $d_1+d_2+\ldots+d_k.$ The degree of a linear combination of monomials (a "polynomial") is the largest degree among the monomial terms with nonzero coefficients. The degree has an intrinsic meaning, because when you change the basis of the original vector space, each vector $X_i$ is newly represented by a linear combination of all the vectors; monomials $X_1^{d_1} X_2^{d_2} \cdots X_k^{d_k}$ thereby become polynomials of the same degree; and consequently the degree of any polynomial is unchanged.

The degree provides a natural "grading" to this polynomial algebra: the vector space generated by all linear combinations of monomials in $X$ of degree up to and including $d+1,$ called the "polynomials of [or up to] degree $d+1$ in $X,$" extends the vector space of polynomials up to degree $d$ in $X.$

Uses of polynomial regression

Often, polynomial regression is exploratory in the sense that we don't know at the outset which monomials to include. The process of creating new model matrices out of monomials and re-fitting the regression may need to be repeated many times, perhaps an astronomical number of times in some machine learning settings.

The chief problems with this approach are

  1. Monomials often introduce problematic amounts of "multicollinearity" in the new model matrix, primarily because powers of a single variable tend to be highly collinear. (Collinearity among powers of two different variables is unpredictable, because it depends on how those variables are related, and therefore is less predictable.)

  2. Changing just a single column of the model matrix, or introducing a new one, or deleting one, may require a "cold restart" of the regression procedure, potentially taking a long time for computation.

The gradings of polynomial algebras provide a way to overcome both problems.

Orthogonal polynomials in one variable

Given a single column vector $X,$ a set of "orthogonal polynomials" for $X$ is a sequence of column vectors $p_0(X), p_1(X), p_2(X),\ldots$ formed as linear combinations of monomials in $X$ alone--that is, as powers of $X$--with the following properties:

  1. For each degree $d=0, 1, 2, \ldots, $ the vectors $p_0(X), p_1(X), \ldots, p_d(X)$ generate the same vector space as $X^0, X^1, \ldots, X^d.$ (Notice that $X^0$ is the $n$-vector of ones and $X^1$ is just $X$ itself.)

  2. The $p_i(X)$ are mutually orthogonal in the sense that for $i\ne j,$ $$p_i(X)^\prime p_j(X) = 0.$$

Usually, the replacement model matrix $$\mathbb{P} = \pmatrix{p_0(X) & p_1(X) & \cdots & p_d(X)}$$ formed from these monomials is chosen to be orthonormal by normalizing its columns to unit length: $$\mathbb{P}^\prime \mathbb{P} = \mathbb{I}_{d+1}.$$ Because the inverse of $\mathbb{P}^\prime \mathbb{P}$ appears in most regression equations and the inverse of the identity matrix $\mathbb{I}_{d+1}$ is itself, this represents a huge computational gain.

Orthonormality very nearly determines the $p_i(X).$ You can see this by construction:

  • The first polynomial, $p_0(X),$ must be a multiple of the $n$-vector $\mathbf{1}=(1,1,\ldots,1)^\prime$ of unit length. There are only two choices, $\pm \sqrt{1/n}\mathbf{1}.$ It is customary to pick the positive square root.

  • The second polynomial, $p_1(X),$ must be orthogonal to $\mathbf{1}.$ It can be obtained by regressing $X$ against $\mathbf{1},$ whose solution is the vector of mean values $\hat X = \bar{X}\mathbf{1}.$ If the residuals $\epsilon = X - \hat X$ are not identically zero, they give the only two possible solutions $p_1(X) = \pm \left(1/||\epsilon||\right)\,\epsilon.$

...

  • Generally, $p_{d+1}(X)$ is obtained by regressing $X^{d+1}$ against $p_0(X), p_1(X), \ldots, p_d(X)$ and rescaling the residuals to be a vector of unit length. There are two choices of sign when the residuals are not all zero. Otherwise, the process ends: it will be fruitless to look at any higher powers of $X.$ (This is a nice theorem but its proof need not distract us here.)

This is the Gram-Schmidt process applied to the intrinsic sequence of vectors $X^0, X^1, \ldots, X_d, \ldots.$ Usually it is computed using a QR decomposition, which is very nearly the same thing but calculated in a numerically stable manner.

This construction yields a sequence of additional columns to consider including in the model matrix. Polynomial regression in one variable therefore usually proceeds by adding elements of this sequence one by one, in order, until no further improvement in the regression is obtained. Because each new column is orthogonal to the previous ones, including it does not change any of the previous coefficient estimates. This makes for an efficient and readily interpretable procedure.

Polynomials in multiple variables

Exploratory regression (as well as model fitting) usually proceeds by first considering which (original) variables to include in a model; then assessing whether those variables could be augmented by including various transformations of them, such as monomials; and then introducing "interactions" formed from products of these variables and their re-expressions.

Carrying out such a program, then, would start with forming univariate orthogonal polynomials in the columns of $\mathbb X$ separately. After selecting a suitable degree for each column, you would then introduce interactions.

At this point, parts of the univariate program break down. What sequence of interactions would you apply, one by one, until a suitable model is identified? Moreover, now that we have truly entered the realm of multivariable analysis, the numbers of options available and their growing complexity suggest there may be diminishing returns in constructing a sequence of multivariate orthogonal polynomials. If, however, you had such a sequence in mind, you could compute it using a QR decomposition.


What R does

Software for polynomial regression therefore tends to focus on computing univariate orthogonal polynomial sequences. It is characteristic for R to extend such support as automatically as possible to groups of univariate polynomials. This what poly does. (Its companion polym is essentially the same code, with a fewer bells and whistles; the two functions do the same things.)

Specifically, poly will compute a sequence of univariate orthogonal polynomials when given a single vector $X,$ stopping at a specified degree $d.$ (If $d$ is too large--and it can be difficult to predict how large is too large--it unfortunately throws an error.) When given a set of vectors $X_1, \ldots, X_k$ in the form of a matrix $\mathbb X,$ it will return

  1. Sequences of orthonormal polynomials $p_1(X_j), p_2(X_j), \ldots, p_d(X_j)$ for each $j$ out to a requested maximum degree $d.$ (Since the constant vector $p_0(X_i)$ is common to all variables and is so simple--it's usually accommodated by the intercept in the regression--R does not bother to include it.)

  2. All interactions among those orthogonal polynomials up to and including those of degree $d.$

Step (2) involves several subtleties. Usually by an "interaction" among variables we mean "all possible products," but some of those possible products will have degrees greater than $d.$ For instance, with $2$ variables and $d=2,$ R computes

$$p_1(X_1),\quad p_2(X_1),\quad p_1(X_2),\quad p_1(X_1)p_1(X_2),\quad p_2(X_2).$$

R does not include the higher-degree interactions $p_2(X_1)p_1(X_2),$ $p_1(X_1)p_2(X_2)$ (polynomials of degree 3) or $p_1(X_2)p_2(X_2)$ (a polynomial of degree 4). (This is not a serious limitation because you can readily compute these products yourself or specify them in a regression formula object.)

Another subtlety is that no kind of normalization is applied to any of the multivariate products. In the example, the only such product is $p_1(X_1)p_1(X_2).$ However, there is no guarantee even that its mean will be zero and it almost surely will not have unit norm. In this sense it is a true "interaction" between $p_1(X_1)$ and $p_1(X_2)$ and as such can be interpreted as interactions usually are in a regression model.

An example

Let's look at an example. I have randomly generated a matrix $$\mathbb{X} = \pmatrix{1 & 3 \\ 5 & 6 \\ 2 & 4}.$$ To make the calculations easier to follow, everything is rounded to two significant figures for display.

The orthonormal polynomial sequence for the first column $X_1 = (1,5,2)^\prime$ begins by normalizing $X_1^0= (1,1,1)^\prime$ to unit length, giving $p_0(X_1) = (1,1,1)^\prime/\sqrt{3} \approx(0.58,0.58,0.58)^\prime.$ The next step includes $X_1^1 = X_1$ itself. To make it orthogonal to $p_0(X_1),$ regress $X_1$ against $p_0(X_1)$ and set $p_1(X_1)$ equal to the residuals of that regression, rescaled to have unit length. The result is the usual standardization of $X_1$ obtained by recentering it and dividing by its standard deviation, $p_1(X_1) = (-0.57,0.79,-0.23)^\prime.$ Finally, $X_1^2 = (1,25,4)$ is regressed against $p_0(X_1)$ and $p_1(X_1)$ and those residuals are rescaled to unit length. We cannot go any further because the powers of $X_1$ cannot generate a vector space of more than $n=3$ dimensions. (We got this far because the minimal polynomial of the coefficients of $X_1,$ namely $(t-1)(t-5)(t-4),$ has degree $3,$ demonstrating that all monomials of degree $3$ or larger are linear combinations of lower powers and those lower powers are linearly independent.)

The resulting matrix representing an orthonormal polynomial sequence for $X_1$ is

$$\mathbb{P_1} = \pmatrix{0.58 & -0.57 & 0.59 \\ 0.58 & 0.79 & 0.20 \\ 0.58 & -0.23 & -0.78}$$

(to two significant figures).

In the same fashion, an orthonormal polynomial matrix for $X_2$ is

$$\mathbb{P_2} = \pmatrix{0.58 & -0.62 & 0.53 \\ 0.58 & 0.77 & 0.27 \\ 0.58 & -0.15 & -0.80}.$$

The interaction term is the product of the middle columns of these matrices, equal to $(0.35, 0.61, 0.035)^\prime.$ The full matrix created by poly or polym, then, is

$$\mathbb{P} = \pmatrix{-0.57 & 0.59 & -0.62 & 0.35 & 0.53 \\ 0.79 & 0.20&0.77& 0.61& 0.27 \\ -0.23 & -0.78 & -0.15 & 0.035 & -0.80}.$$

Notice the sequence in which the columns are laid out: the non-constant orthonormal polynomials for $X_1$ are in columns 1 and 2 while those for $X_2$ are in columns 3 and 5. Thus, the only orthogonality that is guaranteed in this output is between these two pairs of columns. This is reflected in the calculation of $\mathbb{P}^\prime\mathbb{P},$ which will have zeros in positions $(1,2), (2,1), (3,5),$ and $(5,3)$ (shown in red below), *but may be nonzero anywhere else, and will have ones in positions $(1,1), (2,2), (3,3),$ and $(5,5)$ (shown in blue below), but is likely not to have a one in the other diagonal positions ($(4,4)$ in this example). Indeed,

$$\mathbb{P}^\prime\,\mathbb{P} = \pmatrix{\color{blue}{\bf 1} & \color{red}{\bf 0} & 1 & 0.28 & 0.091 \\ \color{red}{\bf 0} & \color{blue}{\bf 1} & -0.091 & 0.3 & 1 \\ 1 & -0.091 & \color{blue}{\bf 1} & 0.25 & \color{red}{\bf 0} \\ 0.28 & 0.3 & 0.25 & 0.5 & 0.32 \\ 0.091 & 1 & \color{red}{\bf 0} & 0.32 & \color{blue}{\bf 1}}.$$

When you inspect the $\mathbb P$ matrix shown in the question, and recognize that multiples of $10^{-17}$ are really zeros, you will observe that this pattern of zeros in the red positions holds. This is the sense in which those bivariate polynomials are "orthogonal."