Solved – Can you add polynomial terms to multiple linear regression

multiple regressionpolynomialregression

I am a little confused about when you should or shouldn't add polynomial terms to a multiple linear regression model. I know polynomials are used to capture the curvature in the data, but it always seems to be in the form of:

$y = x_1 + x_2 + x_1^2 + x_2^2 + x_1x_2 + c$

What if you know that there is a linear relationship between $y$ and $x_1$, but a non-linear relationship between $y$ and $x_2$? Can you use a model in the form of:

$y = x_1 + x_2 + x_2^2 + c$

I guess my question is, is it valid to drop the $x_1^2$ term and the $x_1x_2$ term, or do you have to follow the generic form of a polynomial regression model?

Best Answer

In addition to @mkt's excellent answer, I thought I would provide a specific example for you to see so that you can develop some intuition.

Generate Data for Example

For this example, I generated some data using R as follows:

set.seed(124)

n <- 200
x1 <- rnorm(n, mean=0, sd=0.2)
x2 <- rnorm(n, mean=0, sd=0.5)

eps <- rnorm(n, mean=0, sd=1)

y = 1 + 10*x1 + 0.4*x2 + 0.8*x2^2 + eps

As you can see from the above, the data come from the model $y = \beta_0 + \beta_1*x_1 + \beta_2*x_2 + \beta_3*x_2^2 + \epsilon$, where $\epsilon$ is a normally distributed random error term with mean $0$ and unknown variance $\sigma^2$. Furthermore, $\beta_0 = 1$, $\beta_1 = 10$, $\beta_2 = 0.4$ and $\beta_3 = 0.8$, while $\sigma = 1$.

Visualize the Generated Data via Coplots

Given the simulated data on the outcome variable y and the predictor variables x1 and x2, we can visualize these data using coplots:

library(lattice)

coplot(y ~ x1 | x2,  
       number = 4, rows = 1,
       panel = panel.smooth)

coplot(y ~ x2 | x1,  
       number = 4, rows = 1,
       panel = panel.smooth)

The resulting coplots are shown below.

The first coplot shows scatterplots of y versus x1 when x2 belongs to four different ranges of observed values (which are overlapping) and enhances each of these scatterplots with a smooth, possibly non-linear fit whose shape is estimated from the data.

enter image description here

The second coplot shows scatterplots of y versus x2 when x1 belongs to four different ranges of observed values (which are overlapping) and enhances each of these scatterplots with a smooth fit.

enter image description here

The first coplot suggests that it is reasonable to assume that x1 has a linear effect on y when controlling for x2 and that this effect does not depend on x2.

The second coplot suggests that it is reasonable to assume that x2 has a quadratic effect on y when controlling for x1 and that this effect does not depend on x1.

Fit a Correctly Specified Model

The coplots suggest fitting the following model to the data, which allows for a linear effect of x1 and a quadratic effect of x2:

m <- lm(y ~ x1 + x2 + I(x2^2))  

Construct Component Plus Residual Plots for the Correctly Specified Model

Once the correctly specified model is fitted to the data, we can examine component plus residual plots for each predictor included in the model:

library(car)

crPlots(m)

These component plus residual plots are shown below and suggest that the model was correctly specified since they display no evidence of nonlinearity, etc. Indeed, in each of these plots, there is no obvious discrepancy between the dotted blue line suggestive of a linear effect of the corresponding predictor, and the solid magenta line suggestive of a non-linear effect of that predictor in the model.

enter image description here

Fit an Incorrectly Specified Model

Let's play the devil's advocate and say that our lm() model was in fact incorrectly specified (i.e., misspecified), in the sense that it omitted the quadratic term I(x2^2):

m.mis <-  lm(y ~ x1 + x2)

Construct Component Plus Residual Plots for the Incorrectly Specified Model

If we were to construct component plus residual plots for the misspecified model, we would immediately see a suggestion of non-linearity of the effect of x2 in the misspecified model:

crPlots(m.mis)

In other words, as seen below, the misspecified model failed to capture the quadratic effect of x2 and this effect shows up in the component plus residual plot corresponding to the predictor x2 in the misspecified model.

enter image description here

The misspecification of the effect of x2 in the model m.mis would also be apparent when examining plots of the residuals associated with this model against each of the predictors x1 and x2:

par(mfrow=c(1,2))
plot(residuals(m.mis) ~ x1, pch=20, col="darkred")
abline(h=0, lty=2, col="blue", lwd=2)
plot(residuals(m.mis) ~ x2, pch=20, col="darkred")
abline(h=0, lty=2, col="blue", lwd=2)

As seen below, the plot of residuals associated with m.mis versus x2 exhibits a clear quadratic pattern, suggesting that the model m.mis failed to capture this systematic pattern.

enter image description here

Augment the Incorrectly Specified Model

To correctly specify the model m.mis, we would need to augment it so that it also includes the term I(x2^2):

m <- lm(y ~ x1 + x2 + I(x2^2)) 

Here are the plots of the residuals versus x1 and x2 for this correctly specified model:

par(mfrow=c(1,2))
plot(residuals(m) ~ x1, pch=20, col="darkred")
abline(h=0, lty=2, col="blue", lwd=2)
plot(residuals(m) ~ x2, pch=20, col="darkred")
abline(h=0, lty=2, col="blue", lwd=2)

Notice that the quadratic pattern previously seen in the plot of residuals versus x2 for the misspecified model m.mis has now disappeared from the plot of residuals versus x2 for the correctly specified model m.

Note that the vertical axis of all the plots of residuals versus x1 and x2 shown here should be labelled as "Residual". For some reason, R Studio cuts that label off.

enter image description here

Related Question